In Apache NiFi, I have a flow in which the flowfiles' contents are arrays of JSON objects. Each flowfile has a unique filename attribute.
// flowfile1:
filename: file1.json
[ {}, {}, {}, ... ]
// flowfile2:
filename: file2.json
[ {}, {}, {}, ... ]
Now, I want to put those files into a FTP server, if a file with the given filename does not exist. If such a file does exist, I want to merge those two files together (concatenate the array from the existing FTP file, with the one from the incoming flowfile) and put that updated file into the FTP. The first case (file does not yet exist) is simple, but how can I go about the second one?
You will probably want to use ListFTP to gather the list of files which exist, RouteOnAttribute/RouteOnContent to direct flowfiles referencing existing files to a queue, FetchFTP and MergeContent to join the content of the existing file and the new content, and then PutFTP to place the file on the FTP server again. You will need to investigate approaches to identify the filename attribute in the local flowfiles and match that with the remote file names (I'd suggest persisting the local filenames into a cache when you generate them and routing the remote file listing flowfiles through an enrichment processor. The DistributedMapCache and LookupAttribute processor families will probably be useful here. Abdelkrim Hadjidj has written a great article on using them.
Related
I'm making my first videogame in Unity and I was messing around with storing data in JSON files. More specifically, language localization files. Each file stores key-value pairs of strings. Keys are grouped in categories (like "menu.pause.quit").
Now, since each language file is essentially going to have the same keys (but different values), it would be nice if VS code recognized and helped me write these keys with tooltips.
For example, if you open settings.json in VS code, if you try to write something, there's some kind of autocompletion going on:
How does that work?
The auto completion for json files is done with json schemas. It is described in the documentation:
https://code.visualstudio.com/docs/languages/json#_json-schemas-and-settings
Basically you need to create a json schema which describes your json file and you can map it to json files via your user or workspace settings:
// settings.json
{
// ... other settings
"json.schemas": [
{
"fileMatch": [
"/language-file.json"
],
"url": "./language-file-schema.json"
}
]
}
This will provide auto completion for language-file.json (or any other file matched by the pattern) based on language-file-schema.json(in your workspace folder, but you can also use absolute paths)
The elements of fileMatch may be patterns which will match against multiple files.
I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.
I have a directory with CSV files. each file contains a list of GET requests I'd like to make with JMeter. What I'd like to do is read all the files in a directory, and then loop through each CSV to send the requests in JMeter. The number of files isn't consistent so I don't want to hard code the file names into CSV samplers.
So in effect I'd like to read all the files in the directory and store the files in an array variable. The loop through the array and send the CSV file to the CSV sampler which will in turn read the CSV file and pass the content to an HTTP Request sampler to send the GET requests.
I created a beanshell script to read the files in the directory and store them in an array, but when I try to pass this to the CSV config element, I get errors stating the variable doesn't exist.
I've tried another beanshell script to read the file and pass the lines to an HTTP request Sampler as a variable, but the issue was, it would store all the file contents in memory per thread.
I'd like to know the best approach to read the files, send the requests and use the response data to generate reports
You will not be able to populate CSV Data Set config using Beanshell as CSV Data Set Config is a Configuration Element and according to Execution Order user manual chapter Configuration Elements are executed before anything else.
Since JMeter 3.1 you should not be using Beanshell, it is recommended to switch to JSR223 Elements and Groovy language
I would recommend going for Directory Listing Config plugin, it scans the provided folder (in your case with CSV files) and stores the found paths to files into a JMeter variable
So you can use the Directory Listing Config in combination with __StringFromFile() or __CSVRead() functions and that should be more or less good way of implementing your requirements.
I want to use CouchBase to store lots of data. I have that data in the form:
[
{
"foo": "bar1"
},
{
"foo": "bar2"
},
{
"foo": "bar3"
}
]
I have that in a json file that I zipped into data.zip. I then call:
cbdocloader.exe -u Administrator -p **** -b mybucket C:\data.zip
However, this creates a single item in my bucket; not three as I expected. This actually makes sense as I should be able to store arrays and I did not "tell" CouchBase to expect multiple items instead of one.
The temporary solution I have is to split every items in multiplejson files, then add the lot of them in a single zip file and call cbdocloader again. The problem is that I might have lots of these entries and creating all the files might take too long. Also, I saw in the doc that cbdocloader uses the filename as a key. That might be problematic in my case...
I obviously missed a step somewhere but couldn't find what in the documentation. How should I format my json file?
You haven't missed any steps. The cbdocloader script is very limited at the moment. Couchbase will be adding a cbimport and cbexport tool in the near future that will allow you to add json files with various formats (including the one you mentioned). In the meantime you will need to use the current workaround you are using to get your data loaded.
I need to read a bunch of JSON files from an HDFS directory. After I'm done processing, Spark needs to place the files in a different directory. In the meantime, there may be more files added, so I need a list of files that were read (and processed) by Spark, as I do not want to remove the ones that were not yet processed.
The function read.json converts the files immediately into DataFrames, which is cool but it does not give me the file names like wholeTextFiles. Is there a way to read JSON data while also getting the file names? Is there a conversion from RDD (with JSON data) to DataFrame?
From version1.6 on you can use input_file_name() to get the name of the file in which a row is located. Thus, getting the names of all the files can be done via a distinct on it.