How to rename my hadoop result into a file with ".csv" extension - csv

Actually my intention is to rename the output of a hadoop job to .csv files, because i need to visualize this csv data in rapidminer.
In How can i output hadoop result in csv format it is said, that for this purpose I need to follow these three steps:
1. Submit the MapReduce Job
2. Which will extract the output from HDFS using shell commands
3. Merge them together, rename as ".csv" and place in a directory where the visualization tool can access the final file
If so, how can I achieve this?
UPDATE
myjob.sh:
bin/hadoop jar /var/root/ALA/ala_jar/clsperformance.jar ala.clsperf.ClsPerf /user/root/ala_xmlrpt/Amrita\ Vidyalayam\,\ Karwar_Class\ 1\ B_ENG.xml /user/root/ala_xmlrpt-outputshell4
bin/hadoop fs -get /user/root/ala_xmlrpt-outputshell4/part-r-00000 /Users/jobsubmit
cat /Users/jobsubmit/part-r-00000 /Users/jobsubmit/output.csv
showing:
The CSV file was empty and couldn’t be imported.
when I tried to open output.csv.
solution
cat /Users/jobsubmit/part-r-00000> /Users/jobsubmit/output.csv

Firstly you need to retrieve MapReduce result from HDFS
hadoop dfs -copyToLocal path_to_result/part-r-* local_path
Then cat them into a single file
cat local_path/part-r-* > result.csv
Then it depends your MapReduce result format, if it's already a csv format, then it is done. If not, probably you have to use other tool like sed or awk to transform it into csv format.

Related

How to import csv files into hbase table in a different schema

I've recently started working on hbase and know not much about it. I have multiple csv files (around 20000) and I want to import them into a HBase table in a way that each file would be a row in hbase, and the name of the file would be the rowkey. It means each row of the csv file is a cell in hbase, which I need to put them in a struct datatype(of 25 fields). Unfortunately, I have no clue for the problem. If anyone would be kind to give me some tip to start I appreciate it.
Here is a sample of the csv files:
time, a, b, c, d, ..., x
0.000,98.600,115.700,54.200,72.900,...,0.000
60.000,80.100,113.200,54.500,72.100,...,0.000
120.000,80.000,114.200,55.200,72.900,...,0.000
180.000,80.000,118.400,56.800,75.500,...,0.000
240.000,80.000,123.100,59.600,79.200,...,0.000
300.000,80.000,130.100,61.600,82.500,...,0.000
Thanks,
Importtsv is a utility that will load data in TSV or CSV format into HBase.
Importtsv has two distinct usages:
Loading data from TSV or CSV format in HDFS into HBase via Puts.
Preparing StoreFiles to be loaded via the completebulkload.
Load Data from TSV or CSV format in HDFS to Hbase
Below is the example that allows you to load data from hdfs file to HBase table. You must copy the local file to the hdfs folder then you can load that to HBase table.
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:city, personal_data:age personal /test
The above command will generate the MapReduce job to load data from CSV file to HBase table.

NiFi merge CSV files using MergeRecord

i have a stream of JSON records that i convert it into CSV record successfully with this instruction. but now i want to merge this CSV records into one CSV file. below is that flow:
at step 5 i face with around 9K csv record, how do i merge it into one csv file using MergeRecord processor?
my csv header:
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11
some of this fields may be null and vary in records.
after this use UpdateAttribute configure it so that it can save the file with a filename and after that use putFile to store it to a specific location
I had a similar problem and solved it by using RouteonAttribute processor. Hope this helps someone.
Below is how I configure the processor using ${merge.count:equals(1)}

File not found: trying to append csv data in Stata

I have some .csv files in the same directory and I am trying to append these in Stata. But when I use append, Stata cannot find the next file. My code is the following:
cd "C:\mydir"
insheet using "file1.csv", clear
append using "file2.csv"
With the last line, I obtain the following error:
file file2.csv not found
I have more expertise with R and I know this procedure is similar to rbind.
You can't append a .csv file to a Stata dataset that is produced by insheet. Save the .csv files as Stata files, insheet the last, and then append the Stata ones to that.

AWS download all Spark "part-*" files and merge them into a single local file

I've run a Spark job via databricks on AWS, and by calling
big_old_rdd.saveAsTextFile("path/to/my_file.json")
have saved the results of my job into an S3 bucket on AWS. The result of that spark command is a directory path/to/my_file.json containing portions of the result:
_SUCCESS
part-00000
part-00001
part-00002
and so on. I can copy those part files to my local machine using the AWS CLI with a relatively simple command:
aws s3 cp s3://my_bucket/path/to/my_file.json local_dir --recursive
and now I've got all those part-* files locally. Then I can get a single file with
cat $(ls part-*) > result.json
The problem is that this two-stage process is cumbersome and leaves file parts all over the place. I'd like to find a single command that will download and merge the files (ideally in order). When dealing with HDFS directly this is something like hadoop fs -cat "path/to/my_file.json/*" > result.json.
I've looked around through the AWS CLI documentation but haven't found an option to merge the file parts automatically, or to cat the files. I'd be interested in either some fancy tool in the AWS API or some bash magic that will combine the above commands.
Note: Saving the result into a single file via spark is not a viable option as this requires coalescing the data to a single partition during the job. Having multiple part files on AWS is fine, if not desirable. But when I download a local copy, I'd like to merge.
This can be done with a relatively simple function using boto3, the AWS python SDK.
The solution involves listing the part-* objects in a given key, and then downloading each of them and appending to a file object. First, to list the part files in path/to/my_file.json in the bucket my_bucket:
import boto3
bucket = boto3.resource('s3').Bucket('my_bucket')
keys = [obj.key for obj in bucket.objects.filter(Prefix='path/to/my_file.json/part-')]
Then, use Bucket.download_fileobj() with a file opened in append mode to write each of the parts. The function I'm now using, with a few other bells and whistles, is:
from os.path import basename
import boto3
def download_parts(base_object, bucket_name, output_name=None, limit_parts=0):
"""Download all file parts into a single local file"""
base_object = base_object.rstrip('/')
bucket = boto3.resource('s3').Bucket(bucket_name)
prefix = '{}/part-'.format(base_object)
output_name = output_name or basename(base_object)
with open(output_name, 'ab') as outfile:
for i, obj in enumerate(bucket.objects.filter(Prefix=prefix)):
bucket.download_fileobj(obj.key, outfile)
if limit_parts and i >= limit_parts:
print('Terminating download after {} parts.'.format(i))
break
else:
print('Download completed after {} parts.'.format(i))
The downloading part may be an extra line of code.
As far as cat'ing in order, you can do it according to time created, or alphabetically.
Combined in order of time created: cat $(ls -t) > outputfile
Combined & Sorted alphabetically: cat $(ls part-* | sort) > outputfile
Combined & Sorted reverse-alphabetically: cat $(ls part-* | sort -r) > outputfile

Plink: export subset of data into txt or csv

I performed a GWAS in PLINK and now I would like to look at the data for a small set of SNPs listed one for each line, in a file called snps.txt.
I would like to export the data from PLINK for theses specific SNPs into a .txt or .csv file. Ideally, this file would have the individual IDs as well as the genotypes for these SNPs so that I could later merge it with my phenotype file and perform additional analyses and plots.
Is there an easy way to do that? I know I can use --extract to request specific SNPs only but I can't find a way to tell PLINK to export the data to an "exportable" text-based format.
If you are using classic plink (1.07) you should consider upgrading to plink 1.9. It is a lot faster, and supports many more formats. This answer is for plink 1.9.
Turning binary plink data into a .csv file
It sounds like your problem is that you are unable to turn the binary data into a regular plink text file.
This is easy to do with the recode option. It should be used without any parameters to convert to the plink text format:
plink --bfile gwas_file --recode --extract snps.txt --out gwas_file_text
If you want to convert the .ped data to a csv afterwards you could do the following:
cut -d " " -f2-2,7- --output-delimiter=, gwas_file_text.ped
This produces a comma-delimited file with IDs in the first column and then genotypes.
Turning plink data into other text based file formats
Note that you can also convert the data to a lot of other text-based filetypes, all described in the docs.
One of these is the common variant call format (VCF), which makes a file with the snps and individual IDs all in one file, as requested:
plink --bfile gwas_file --recode vcf --extract snps.txt --out gwas_file_text