Hadoop: map/reduce from HDFS - configuration

I may be wrong, but all(?) examples I've seen with Apache Hadoop takes as input a file stored on the local file system (e.g. org.apache.hadoop.examples.Grep)
Is there a way to load and save the data on the Hadoop file system (HDFS)? For example I put a tab delimited file named 'stored.xls' on HDFS using hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls. How should I configure the JobConf to read it ?
Thanks .

JobConf conf = new JobConf(getConf(), ...);
...
FileInputFormat.setInputPaths(conf, new Path("stored.xls"))
...
JobClient.runJob(conf);
...
setInputPaths will do it.

Pierre, the default configuration for Hadoop is to run in local mode, rather than in distributed mode. You likely need to just modify some configuration in your hadoop-site.xml. It looks like your default filesystem is still localhost, when it should be hdfs://youraddress:yourport. Look at your setting for fs.default.name, and also see the setup help at Michael Noll's blog for more details.

FileInputFormat.setInputPaths(conf, new Path("hdfs://hostname:port/user/me/stored.xls"));
This will do

Related

Spark S3 CSV read returns org.apache.hadoop.mapred.InvalidInputException

I see several posts here and in a Google search for org.apache.hadoop.mapred.InvalidInputException
but most deal with HDFS files or trapping errors. My issue is that while I can read a CSV file from spark-shell, running it from a compiled JAR constantly returns an org.apache.hadoop.mapred.InvalidInputException error.
The rough process of the jar:
1. read from JSON documents in S3 (this works)
2. read from parquet files in S3 (this also succeeds)
3. write a result of a query against #1 and #2 to a parquet file in S3 (also succeeds)
4. read a configuration csv file from the same bucket #3 is written to. (this fails)
These are the various approaches that I have tried in code:
1. val osRDD = spark.read.option("header","true").csv("s3://bucket/path/")
2. val osRDD = spark.read.format("com.databricks.spark.csv").option("header", "true").load("s3://bucket/path/")
All variations of the two above with s3, s3a and s3n prefixes work fine from the REPL but inside a JAR they return this:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/path/eventsByOS.csv
So, it found the file but can't read it.
Thinking this was a permissions issue, I have tried:
a. export AWS_ACCESS_KEY_ID=<access key> and export AWS_SECRET_ACCESS_KEY=<secret> from the Linux prompt. With Spark 2 this has been sufficient to provide us access to the S3 folders up until now.
b. .config("fs.s3.access.key", <access>)
.config("fs.s3.secret.key", <secret>)
.config("fs.s3n.access.key", <access>)
.config("fs.s3n.secret.key", <secret>)
.config("fs.s3a.access.key", <access>)
.config("fs.s3a.secret.key", <secret>)
Before this failure, the code reads from parquet files located in the same bucket and writes parquet files to the same bucket. The CSV file is only 4.8 KB in size.
Any ideas why this is failing?
Thanks!
Adding stack trace:
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:253)
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1332)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.take(RDD.scala:1326)
org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1367)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.first(RDD.scala:1366)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:206)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
scala.Option.orElse(Option.scala:289)
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
nothing springs out when I paste that stack into the IDE, but I'm looking at a later version of Hadoop and can't currently switch to older ones.
Have a look at these instructions
That landsat gz file is actually a CSV file you can try to read in; it's the one we generally use for testing because its there and free to use. Start by seeing if you can work with it.
If using spark 2.0, use spark's own CSV package.
Do use S3a, not the others.
I solved this problem by adding the specific Hadoop configuration for the appropriate method (s3 in the example here). The odd thing is that the above security works for everything in Spark 2.0 EXCEPT reading the CSV.
This code solved my problem using S3.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", p.aws_accessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey",p.aws_secretKey)

How to pick path of file from System properties in jmeter CSV Data Set Config

I am trying to use CSV Data Set Config to get some data from csv file to be used in jmeter script but i don't want to hardcode the file path as it will be changing according to the test environment. Is there a way i can pick this path from System properties i.e some export set in my bashrc file.
Export in my bashrc :
export NIMBUS4_PERFORMANCE_TEST_REPO=/Users/rahul/Documents/verecloud/performancetest/data/user.csv
I would suggest the following workaround:
Change "Filename" setting of the CSV Data Set Config to following:
${__BeanShell(System.getenv().get("NIMBUS4_PERFORMANCE_TEST_REPO"))}
Where:
System.getenv() - method which provides access to underlying operating system environment variables
__Beanshell() - JMeter built-in function which allows executing of arbitrary Beanshell code
you could create a softlink at some static path. For example,
say we have created a soft link to /user/data/csvs folder.
You are in say ~/Documents , there run below
ln -s /user/data/csvs
Now we can access it in the jmeter and you will also have the flexibility to modify the softlink to point to some other location too.
Only constraint i see is the pointed directory name shouldn't change.
Hope this will help!!!
You can have just users.csv if the file is in the same folder as the .jmx itself;
You can have ${location}\users.csv
And in your UserDefinedVariables you'll have
and in non-gui mode you'll refer as
%RUNNER_HOME%\Test.jmx -Jloc=%RUNNER_HOME%\users.csv -Jusers=100 -Jloop=1 -Jrampup=5

neo4j LOAD CSV returns Couldn't Load external resource

Trying CSV import to Neo4j - doesn't seem to be working.
I'm loading a local file using the syntax:
LOAD CSV WITH HEADERS FROM "file:///location/local/my.csv" AS csvDoc
Am wondering if there's something wrong with my CSV file, or if there's some syntax problem here.
If you didn't read the title, the error is:
Couldn't load the external resource at: file:/location/local/my.csv
[Neo.TransientError.Statement.ExternalResourceFailure]
Neo4j seems to need a full path spec to get a file on the local system.
On linux or mac try
LOAD CSV FROM "file:/Users/you/location/local/my.csv"
On windows try
LOAD CSV FROM "file://c:/location/local/my.csv"
.
In the browser interface (Neo4j 3.0.3, MacOS 10.11) it looks like Neo4j prefixes your file path with $path_to_graph_database/import. So you could move your files there. If you are using a command line tool, then see this SO question.
Easy solution:
Once you choose your database location (in my case ReactomeGraphDB60)...
here I placed my ddbb
...go to that folder, and create inside a folder called "import".
Later in the cypher query write (as an example):
LOAD CSV WITH HEADERS FROM "file:///ILClasiffStruct.csv" AS row
CREATE (n:Interleukines)
SET n = row

2 Hdfs files comparison

I have 6000+ .csv files in /hadoop/hdfs/location1 and 6100+ .csv files in /hadoop/hdfs/location2.
I want to compare these two hdfs directories and find the diff of files. The diff .csv files(non-similar) should be reflected in a 3rd hdfs directory(/hadoop/hdfs/location3). I am not sure we can use diff command as in unix to hdfs file system.
Any idea on how to resolve this would be appreciable.
Anshul
You could use some python (perl/etc.) script to check it. Depending on your special needs and speed, you could check for file-size first. Are the filenames identical? Are the creation-dates identical etc.?
If you want to use python, check out the filecmp module.
>>> import filecmp
>>> filecmp.cmp('undoc.rst', 'undoc.rst')
True
>>> filecmp.cmp('undoc.rst', 'index.rst')
False
Look at the below post which provides an answer on how to compare 2 HDFS files. You will need to extend this for 2 folders.
HDFS File Comparison
You could easily do this with the Java API and create a small app:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;
We don't have hdfs commands to compare the files.
Check below post we can achieve by writing the PIG Program or We need to Write Map Reduce Program.
Equivalent of linux 'diff' in Apache Pig
I think below steps will solve your problem:
Get the list of file names which are in first location into one file
Get the second location files into another file
Find the diff between two files using unix commands
Whatever the diff files you found, copy those files in the other location.
I hope this helps you. otherwise let me know.

iSeries Export to CSV

Is there an iSeries command to export the data in a table to CSV format?
I know about the Windows utilities, but since this needs to be run automatically I need to run this from a CL program.
You can use CPYTOIMPF and specify the TOSTMF option to place a CSV file on the IFS.
Example:
CPYTOIMPF FROMFILE(DBFILE) TOSTMF('/outputfile.csv') STMFCODPAG(*PCASCII) RCDDLM(*CRLF)
If you want the data to be downloaded directly to a PC, you can use the "Data Transfer from iSeries" function of IBM iSeries Client Access to create a .CSV file. In the file output details dialog, set the file type to Comma Separated Variable (CSV).
You can save the transfer description to be reused later.
You could use a trigger. The iSeries Client Access software wont do since that is a windows application, what I understand is that you need the data to be exported each time that the file is written. Check this link to know more about triggers.
You are going to need FTP to perform that action.
If your iSeries shop uses ZMOD/FTP your shortest solution is a few lines of code away -- 3 lines to be exact -- the three lines are to Start FTP, Put DBF, and finally, End FTP.
IF you don't use ZMOD/FTP:
- You could use native FTP/400 to accomplish what you need to do, but it is quite involved!!!
- you may probably need to use an RPGLE program to parse, format, and move, data into a "flatfile", then use native FTP/400 to FTP the file out
- and yes, a CL will need as a wrapper!
You can do it all in one very simple CL program:
CPYTOIMPF the file TOSTMF -> the cvs file will be in the IFS
FTP the file elsewhere (to a server or a PC)
It works like a charm