2 Hdfs files comparison - csv

I have 6000+ .csv files in /hadoop/hdfs/location1 and 6100+ .csv files in /hadoop/hdfs/location2.
I want to compare these two hdfs directories and find the diff of files. The diff .csv files(non-similar) should be reflected in a 3rd hdfs directory(/hadoop/hdfs/location3). I am not sure we can use diff command as in unix to hdfs file system.
Any idea on how to resolve this would be appreciable.
Anshul

You could use some python (perl/etc.) script to check it. Depending on your special needs and speed, you could check for file-size first. Are the filenames identical? Are the creation-dates identical etc.?
If you want to use python, check out the filecmp module.
>>> import filecmp
>>> filecmp.cmp('undoc.rst', 'undoc.rst')
True
>>> filecmp.cmp('undoc.rst', 'index.rst')
False

Look at the below post which provides an answer on how to compare 2 HDFS files. You will need to extend this for 2 folders.
HDFS File Comparison
You could easily do this with the Java API and create a small app:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;

We don't have hdfs commands to compare the files.
Check below post we can achieve by writing the PIG Program or We need to Write Map Reduce Program.
Equivalent of linux 'diff' in Apache Pig

I think below steps will solve your problem:
Get the list of file names which are in first location into one file
Get the second location files into another file
Find the diff between two files using unix commands
Whatever the diff files you found, copy those files in the other location.
I hope this helps you. otherwise let me know.

Related

Copying multiple files from one folder to another in the same S3 bucket

I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.

Export from Couchbase to CSV file

I have a Couchbase Cluster with only one node (let's call it localhost) and I need to export all the data from a very big bucket (let's call it XXX) into a CSV file.
Now this seems to be a pretty easy task but I can't find the way to make it work.
According to the (really bad) documentation on the cbtransfer toold from Couchbase http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html they say this is possible but they don't explain it clearly. They just add a flag if you want the transfer to occur in csv format (?) but it is not working. Maybe someone who already did this can give me a hand?
Using the documentation I've been able to make an approach to the result I want to obtain (a clean CSV file with all the documents in the XXX bucket) using this command:
/opt/couchbase/bin/cbtransfer http://localhost:8091 /path/to/export/output.csv -b XXX
But what I get is that /path/to/export/output.csv is actually a folder with a lot of folders inside and it is storing some kind of json metadata that can be used to restore the XXX bucket in another instance of Couchbase.
Has anyone been able to export data from a Couchbase bucket (Json documents) into a CSV file?
From looking at the documentation, you have to put a slightly different syntax to export to a CSV. http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html
It needs to look like so:
cbtransfer http://[localhost]:8091 csv:./data.csv -b default -u Administrator -p password
Notice the "csv:" before the name of the csv file.
I tested this and it does export a CSV. Just be forwarned that you need a relatively flat document structure for this to work really well, as JSON can represent far more complex data structures than CSV obviously, e.g. arrays, sub-documents, etc. cbtransfer will not unravel those. For example, if there is a subdocument, cbtransfer will represent it as a JSON doc in the line of each CSV.
So depending on what your document structure is, exporting to CSV is not an ideal format. It is a step backwards.

How to restore mysql database if in the file is in .gz format and has different parts?

I need a little help here about restoring mysql database. My boss gave me a backup of database which is in the .gz format and it has 7 parts. I extract one od the .gz file but it doesn't include all the files like in .rar files if you extract the first .rar file it will extract the others and make it as a whole file. In my case after extracting the first .gz folder it only extracted the first one. It doesn't extract the others. How can I solve this problem guys? By the way after I extracting the first folder it has a file named 'db.backup.1' and when I opened this in a text editor it shows my database and the table. inside it. I also extracted the others and they are all the same. How can I merge them? Each file has a 125MB of data.
unzip them first
merge them using the append cmd in linux
cat file1 >> masterfile.sql
and so on
cat file2 >> masterfile.sql
import your masterfile.sql in the desired database.
Windowzzzzz version
-form cmd line run
copy file1+file2+file3 targetfile

Make a searchable volume in mysql

I need to put the contents of a Volume in a mysql database, to be searchable via a web interface.
To get all the files/folders, I can do:
$ cd /Volumes/myVolume
$ find ./
Which will give me all I need to know.
If my mysql table only has one column called path, what would be the most efficient way to write all the paths to the table, given there are 1M+ paths.
Pipe the output of the script above to a file and then import the file.
Run the following:
find ./ > directorylisting.txt
Open the file and see how to import it into MySQL using one of the many import functions available. The link daniph mentioned in the comment on your question has some links. You can use the mysqlimport or LOAD DATA INFILE statement to load this file into the table. Index it properly and you should be well away.

CSVDE export file-column order wrong?

I'm using CSVDE to export data from our active directory into a CSV file, which then gets imported into a database. I'm using the -l switch to specify the columns that I'd like to export, but they don't come out in the same order consistently. Is there a workaround for this that doesn't involve opening the file in Excel? This is a nightly batch process and we'd like it to run unattended.
Thanks!
If you simply want a command-line utility that can re-order the CSV (and do much else as well), take a look at my FOSS CSV stream editor, CSVfix.
Per the docs:
LDAP can return attributes in any
order, and csvde does not attempt to
impose any order on the columns.
How about writing a python script to read reorder the csv file? You may find the python csv module useful for this.