I'm new to Camel and the lack of similar questions online leads me to believe Im doing something silly. I am using camel 2.12.1 components and am parsing large CSV files from both local directories and by downloading them over SFTP. I've found that
split(body().tokenize("\n")).streaming().unmarshal().csv()
works for local files (windows 7); I get multiple exchanges with a
List<List<String>>
for each line in the csv file. But when I use that same route syntax from an sftp component (connecting to a linux server to download the files), I get a single exchange with a single line that reads like a call to "ls":
-rwxrwxrwx 1 userName userName 83400 Dec 16 14:11 fileName.csv
Through trial and error, I found that
split(body()).streaming().unmarshal().csv()
with the sftp component will correctly load and parse the file, but it doesn't do it in streaming mode, it loads the entire file into memory before unmarshalling it into a single exchange.
I found a similar bug report (https://issues.apache.org/jira/browse/CAMEL-6231) from camel 2.10 which Clause closed as Invalid indicating the reporter was using thread and parallel with stream incorrectly, but Im not configuring either of those capabilities.
The sftp stanza im using is:
sftp://192.168.1.1?fileName=fileName.csv&username=userName&password=secret!&idempotent=true&localWorkDirectory=tmp
The file stanza is :
"file:test/data?noop=true&fileName=fileName.csv"
Anyone have an idea on what im doing wrong ?
make an intermediate route to solve the problem.
<route id="StagingFtpFileCopy">
<from uri="ftp://{{uriFtpPath}}"/>
<to uri="file://data/staging"/>
</route>
I faced the same issue as well with SFTP (Camel 2.25.0). However, before splitting the route into 2 different routes (as proposed by others), I used the below url
sftp://:22/?username=random&password=random&delay=2000&move=archive&readLock=changed&bridgeErrorHandler=true&recursive=false&disconnect=true&stepwise=false&streamDownload=true&localWorkDirectory=C:/temp
with my below route definition,
from("sftp url").split().tokenize("\n", 10, true).streaming().to("log:out")
As this route also downloads the remote file into local (same as 2 route option) and then treat the local file with normal streaming (as Sinsanator, mentioned it works perfectly with file), the memory footprint becomes a truely saw-tooth while downloading (upto 100MB) and then it used upto 150MB while processing but roughly a saw-tooth nature again.
One advantage (in my view) with this approach could be we can handle the completion related task (e.g. moving the remote file into other directory) based on actual processing completion (which is not possible automatically if we break the routes). Also, since the downloading is managed by Camel, the local file gets deleted automatically on completion of processing.
Related
I used to download a node of firebase real-time database every day to monitor some outputs by exporting the .JSON file for that node. The JSON file itself is about 8MB.
Recently, I started receiving an error:
"Exporting JSON Unable to export The size of data exported at a single location cannot exceed 256 MB.Navigate to a smaller part of the database or use backups. Read more about limits"
Can someone please explain why I keep getting this error, since the JSON file I exported just yesterday was only 8.1 MB large.
I probably solved it! I disabled CORS addon in Chrome and suddenly it worked to export :)
To get rid of this, you can use Postman's Import feature because downloading a large JSON file sometimes faces failure in the middle of the way using a browser from the dashboard of the firebase. You can put the traditional cUrl commands on it. You just need to click save the response after the response is reached. To get rid of complex authentication complexity, you make the rule permission of the firebase database to read:true until the download is complete thought you need to ensure security for this. Postman also needs sometimes to preview the JSON even freezing the UI but you don't need to be bothered with it.
I see several posts here and in a Google search for org.apache.hadoop.mapred.InvalidInputException
but most deal with HDFS files or trapping errors. My issue is that while I can read a CSV file from spark-shell, running it from a compiled JAR constantly returns an org.apache.hadoop.mapred.InvalidInputException error.
The rough process of the jar:
1. read from JSON documents in S3 (this works)
2. read from parquet files in S3 (this also succeeds)
3. write a result of a query against #1 and #2 to a parquet file in S3 (also succeeds)
4. read a configuration csv file from the same bucket #3 is written to. (this fails)
These are the various approaches that I have tried in code:
1. val osRDD = spark.read.option("header","true").csv("s3://bucket/path/")
2. val osRDD = spark.read.format("com.databricks.spark.csv").option("header", "true").load("s3://bucket/path/")
All variations of the two above with s3, s3a and s3n prefixes work fine from the REPL but inside a JAR they return this:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/path/eventsByOS.csv
So, it found the file but can't read it.
Thinking this was a permissions issue, I have tried:
a. export AWS_ACCESS_KEY_ID=<access key> and export AWS_SECRET_ACCESS_KEY=<secret> from the Linux prompt. With Spark 2 this has been sufficient to provide us access to the S3 folders up until now.
b. .config("fs.s3.access.key", <access>)
.config("fs.s3.secret.key", <secret>)
.config("fs.s3n.access.key", <access>)
.config("fs.s3n.secret.key", <secret>)
.config("fs.s3a.access.key", <access>)
.config("fs.s3a.secret.key", <secret>)
Before this failure, the code reads from parquet files located in the same bucket and writes parquet files to the same bucket. The CSV file is only 4.8 KB in size.
Any ideas why this is failing?
Thanks!
Adding stack trace:
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:253)
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1332)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.take(RDD.scala:1326)
org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1367)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.first(RDD.scala:1366)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:206)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
scala.Option.orElse(Option.scala:289)
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
nothing springs out when I paste that stack into the IDE, but I'm looking at a later version of Hadoop and can't currently switch to older ones.
Have a look at these instructions
That landsat gz file is actually a CSV file you can try to read in; it's the one we generally use for testing because its there and free to use. Start by seeing if you can work with it.
If using spark 2.0, use spark's own CSV package.
Do use S3a, not the others.
I solved this problem by adding the specific Hadoop configuration for the appropriate method (s3 in the example here). The odd thing is that the above security works for everything in Spark 2.0 EXCEPT reading the CSV.
This code solved my problem using S3.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", p.aws_accessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey",p.aws_secretKey)
I am trying to automate weekly generation of a database. As a first step in this process, I need to obtain a set of files from network location M:\. The process is as follows:
Delete any possibly remaining old source files from my local folder (REMOVE_OLD_FILES).
Obtain the names of the required files using regular expressions (GET_FILES).
Copy the files from the network location to my local folder for further processing (COPY/MOVE FILES)
Step 3 is where I run into trouble, I frequently receive the below error:
Error processing files. Exception : org.apache.commons.vfs.FileNotFoundException: Could not read from "file:///M:/FILESOURCE/FILENAME.zip" because it is a not a file.
However, when I manually locatae the 'erroneous' file on the network location and try to open or copy it, there are no problems. If I then re-run the Spoon job, no errors occur for this file (although the next file might lead to an error).
So far, I have verified that steps 1 and 2 run correctly: more specifically, there are no errors in the file names returned from step 2.
Obviously, I would prefer not having to manually open all the files first to ensure that Spoon can correctly copy them. Does anyone have an idea what might be causing this behaviour?
For completeness, below are the parameters selected in the COPY/MOVE FILES step.
I was facing same issue with different clients and finally i tried with some basic approach and it got resolved. It might help in your case as well.
Also, other users can follow this rule.
Just try this: Create all required folder with Spoon Job "Create a Folder" and inactive/delete those hops from your job or transformation once your folders are created.
This is because, user you are using to delete the file/s is not recognized as Windows User. Once your folder is in place you can remove "Create a Folder" steps from your Job.
The path to the file is wrong. If you are running spoon in a Windows environment you should use the Windows format for filepaths. Try changing from
"file:///M:/FILESOURCE/FILENAME.zip"
To
"M:\FILESOURCE\FILENAME.zip"
By the way, it will only work if M: is an actual drive in the machine. If you want to access a file in the network you should use the network path to the shared folder, this way:
"\\MachineName\M$\FILESOURCE\FILENAME.zip"
or
"\\MachineName\FILESOURCE\FILENAME.zip"
If you try to access a file in a network mounted drive it won't work.
I am getting the exception "ValueError: insecure string pickle" when attempting to run my program after creating a sandbox from MKS.
Hopefully you are still interested in helping if you are still reading this, so here's the full story.
I created an application in Python that analyzes data. When saving specific data from my program, I pickle the file. I correctly read and write it in binary and everything is working correctly on my computer.
I then used py2exe to wrap everything into an .exe. However, in order to get the pickled files to continue to work, I have to physically copy them into the the folder that py2exe. So my pickle is inside of the .exe folder and everything is working correctly when I run the .exe.
Next, I upload everything to MKS (an ALM, here is the Wikipedia page http://en.wikipedia.org/wiki/MKS_Integrity).
When I proceed to create a sandbox of my files and run the program, I get the dreaded "insecure string pickle" error. In other words, I am wondering if MKS screwed something up or added an end of line character to my pickle files. When I compare the contents of the MKS pickle file and the one I created before I uploaded the program to MKS, there are no differences.
I hope this is enough detail to describe my problem.
Please help!
Thanks
Have you tried adding your pickled files to your Integrity sandbox as binaries and not text?
When adding the file, on the Create Archive interface, select the options button, and change data type to "Binary" from "Auto". This will maintain any non-text formatting within the file.
:image => StorageRoom::Image.new_with_filename(path)
I have to get the path of the image. So far i have specified the path manually and it worked and now i have put in heroku but it shows Load Error - No such file present.
How can i get the path value of the local system using browse button.
Your problem may not be related to path names, but to the fact that Heroku has a read-only file system. If you try to write files onto disk in a Heroku app, it simply doesn't work -- the file will not be saved.
The exception is the "temp" directory. You can save files there, but they are not guaranteed to persist for longer than the duration of a single request.
Is the file you are trying to open actually saved in your Git repo? If so, it will be on the disk in your Heroku app, and you should be able to open it.
To see what the filesystem layout looks like on your Heroku instance, you can create a controller method like:
render :inline => Dir['**/*'].inspect
File.expand_path
Reference : http://saaridev.blogspot.com/2006/11/ruby-finding-absolute-path-of-running.html
You don't need the full path. As far as file path in the client machine is concerned for file uploads, the path is irrelevant as it poses security risks for the user.
Most modern browsers don't send the file path for file uploads. You could get the path using Javascript or Flash but still I don't see the logic behind doing this.
When a user clicks on the submit button the browser should at least send you the file name with the file data together with a bunch of other information like the mime type. Your web server would either write the file to disk or process it in memory assuming you have near infinite memory resources. Look at the RFC 1867 for file uploads for more on this.