I have another question on this here Open a CSV file from S3 using Roo on Heroku but I'm not getting any bites - so a reword:
I have a CSV file in an S3 bucket
I want to read it using Roo in a Heroku based app (i.e. no local file access)
How do I open the CSV file from a stream?
Or is there a better tool for doing this?
I am using Rails 4, Ruby 2. Note I can successfuly open the CSV for reading if I post it from a form. How can I adapt this to snap the file from an S3 bucket?
Short answer - don't use Roo.
I ended up using the standard CSV commands, working with small CSV files you can very simply read the file contents into memory using something like this:
body = file.read
CSV.parse(body, col_sep: ",", headers: true) do |row|
row_hash = row.to_hash
field = row_hash["FieldName"]
reading a file passed in from a form, just reference the params:
file = params[:file]
body = file.read
To read in form S3 you can use the AWS gem:
s3 = AWS::S3.new(access_key_id: ENV['AWS_ACCESS_KEY_ID'], secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'])
bucket = s3.buckets['BUCKET_NAME']
# check each object in the bucket
bucket.objects.each do |obj|
import_file = obj.key
body = obj.read
# call the same style import code as above...
end
I put some code together based on this:
Make Remote Files Local With Ruby Tempfile
and Roo seems to work OK when handed a temp file. I couldn't get it to work with S3 directly. I don't particularly like the copy approach, but my processing runs on delayed job, and I want to keep the Roo features a little more than I dislike the file copy. Plain CSV files work without fishing out the encoding info, but XLS files would not.
Related
I am trying to write a JSON file into my AWS S3 bucket. However, I do not get a JSON file after it has been uploaded.
I get my data from a website using a request.get() and format it into a JSON file which I then run the following script to upload it to me S3 bucket.
r = requests.get(url=id_url, params=params)
data = r.json()
s3_client.put_object(
Body=json.dumps(data, indent=3),
Bucket='bucket-name',
Key=fileName
)
However, I am not sure what the type of file is but it is supposed to be saved as a JSON file.
Screenshot of my S3 bucket having no file type
Screenshot of my download folder, showing unable to identify the file type
When I open the file by selecting Pycharm, it is just a dictionary with key and values
Solved it, I simply added ".JSON" to the filename and it has solved the file formatting issue. Dont know why I didnt think of this earlier.
Thank you
Ideally you shouldn't rely of file extensions to specify the content type. The put_object method supports specifying ContentType. This means that you can use any file name you like, without needing to specify .json.
e.g.
s3_client.put_object(
Body=json.dumps(data, indent=3),
Bucket='bucket-name',
Key=fileName,
ContentType='application/json'
)
I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.
How can I make JMeter read the second sheet of my CSV?
I want to use CSV Data Set Config.
Normally, it reads the first line of the first sheet but is there any way to be a bit more flexible?
CSV file format doesn't have "sheets", it is a normal plain text file using delimiters in order to represent structured data.
If you are trying to get data from i.e. Microsoft Excel file type - unfortunately you won't be able to do it using CSV Data Set Config. The easiest would be exporting data as separate plain-text CSV files.
If you don't have the possibility to do the export you still can access the data from Excel files but it will be a little bit more tricky as you will have to use JSR223 Test Elements, Groovy language and Apache POI libraries
More information:
Busy Developers' Guide to HSSF and XSSF Features
How to Extract Data From Files With JMeter
Currently you can use CSV Data Set Config for that, you should add external code for example using Apache Commons CSV,
Download the jar file and place it in JMETER_HOME lib folder, and then write the code in JSR223 Element.
Examples exists, code for get second record:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in);
// go to next record
records.next();
CSVRecord secondRecord = records.next();
//columnOne = secondRecord.get(0);
Trying CSV import to Neo4j - doesn't seem to be working.
I'm loading a local file using the syntax:
LOAD CSV WITH HEADERS FROM "file:///location/local/my.csv" AS csvDoc
Am wondering if there's something wrong with my CSV file, or if there's some syntax problem here.
If you didn't read the title, the error is:
Couldn't load the external resource at: file:/location/local/my.csv
[Neo.TransientError.Statement.ExternalResourceFailure]
Neo4j seems to need a full path spec to get a file on the local system.
On linux or mac try
LOAD CSV FROM "file:/Users/you/location/local/my.csv"
On windows try
LOAD CSV FROM "file://c:/location/local/my.csv"
.
In the browser interface (Neo4j 3.0.3, MacOS 10.11) it looks like Neo4j prefixes your file path with $path_to_graph_database/import. So you could move your files there. If you are using a command line tool, then see this SO question.
Easy solution:
Once you choose your database location (in my case ReactomeGraphDB60)...
here I placed my ddbb
...go to that folder, and create inside a folder called "import".
Later in the cypher query write (as an example):
LOAD CSV WITH HEADERS FROM "file:///ILClasiffStruct.csv" AS row
CREATE (n:Interleukines)
SET n = row
I'm new to the whole Hadoop/Hortonworks/Pig stuff, so excuse me for the question.
I have installed the Hortonworks Sandbox. I'm trying to load a twitter JSON file and perform some queries on the file, but I'm currently stuck in the loading file part.
I know that I should use the Elephant-bird in order to load a JSON file (without specifying the JSON schema) with JsonLoader(), so I've downloaded the Elephant-bird from the git repo and I've included the jar file
Elephant-bird\repo\com\twitter\elephant-bird\2.2.3\elephant-bird-2.2.3.jar
inside the Hortonworks Sandbox. Here a screen shot with my Pig script:
REGISTER elephant-bird-2.2.3.jar;
Json1 = LOAD 'JSON/sample.tweets' JsonLoader();
DESCRIBE Json1;
STORE Json1 INTO 'tweeterOutput';
Unfortunately I cannot get any results from this script execution. I've tried with both STORE and DUMP commands.
Probably I'm doing many wrong things in this process flow, so any help will be appreciated!
You are missing the USING keyword:
Json1 = LOAD 'JSON/sample.tweets' USING JsonLoader();
Fix the below
You need to add few more jars: elephant-bird-core-4.4.jar, elephant-bird-pig-4.4.jar, elephant-bird-hadoop-compat-4.4.jar, json-simple-1.1.1.jar
Register all of them in the script
REGISTER elephant-bird-core-4.4.jar;
REGISTER elephant-bird-pig-4.4.jar;
REGISTER elephant-bird-hadoop-compat-4.4.jar;
REGISTER json-simple-1.1.1.jar;
LOAD 'JSON/sample.tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');