Data Connection - downloading files from multiple URLs in one sync

Data Connection - downloading files from multiple URLs in one sync - palantir-foundry

How can I use Data Connection to download files from a large list of URLs as part of a single sync?
I want to be able to do this in parallel, since I'll be downloading 1–2,000 new files per day. I also want all the files to be stored in a single dataset.

This is supported using the magritte-rest-plugin. Configre the source as a map of magritte rest sources. The format of the source is:
type: magritte-rest-v2
sourceMap:
source_name_1:
source-object
source_name_2:
source-object2
which when filled out, might look like:
type: magritte-rest-v2sourceMap:
my_api:
type: magritte-restauthToken: "{{token}}"url: "https://some-api.com/"another_api:
type: magritte-restauthToken: "{{other_token}}"url: "https://some-other_api.com/"

Related

How to upload csv data that contains newline with dbt

I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed.
I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.
When I run dbt seed, I get the following error:
16:34:43 Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43 Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.
There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options, but I haven't found anything relevant.
Any ideas?

As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.
That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.
And more importantly - as mentioned in this answer from Jeremy on stackoverflow,
The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here.
Which for your case, should be either the quote or allowQuotedNewlines options. If you did choose to use dbt-external-tables your source.yml for this would look something like:
gcs.yml
version: 2
sources:
- name: clickup
database: external_tables
loader: gcloud storage
tables:
- name: task
description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
external:
location: 'gs://bucket/clickup/task/*'
options:
format: csv
skip_leading_rows: 1
quote: "\""
allow_quoted_newlines: true
Or something very similar.
And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv - you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:
Retrieve "filename" from gcp storage during dbt-external-tables sideload?
I find it to be a very powerful set of tools for files specifically with BigQuery.

Copying multiple files from one folder to another in the same S3 bucket

I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.

If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.

What is an alternative to CSV data set config in JMeter?

We want to use 100 credentials from .csv but I would rather like to know if there is any other alternative to this available in jmeter.

If you have the credentials in the CSV file there are no better ways of "feeding" them to JMeter than CSV Data Set Config.
Just in case if you're still looking for alternatives:
__CSVRead() function. The disadvantage is that the function reads the whole file into memory which might be a problem for large CSV files. The advantage is that you can choose/change the name of the CSV file dynamically (in the runtime) while with the CSV Data Set Config it has to be immutable and cannot be changed once it's initialized.
JDBC Test Elements - allows fetching data (i.e. credentials) from the database rather than from file
Redis Data Set - allows fetching data from Redis data storage
HTTP Simple Table Server - exposes simple HTTP API for fetching data from CSV (useful for distributed architecture when you want to ensure that different JMeter slaves will use the different data), this way you don't have to copy .csv file to slave machines and split it

There are few alternatives
JMeter Plugin for reading random CVS data : Random CSV Data Set Config
JMeter function : __CSVRead
Reading CSV file data from a JSR223 Pre Processor
CSV Data Set Config is simple, easier to user and available out of the box.

How do I populate MongoDB with my CDN Links?

I have a JSON file, and I have some audio file in each entry.
{
{
audio: '~/audios/1.mp3'
info: 'some other info'
},
{
audio: '~/audios/2.mp3'
info: 'some other info'
},
{
audio: '~/audios/3.mp3'
info: 'some other info'
}
}
Now I would like to put all of this stuff in my MongoDB database (instead of using this JSON). In the very end my app will be using some service to store the mp3 files on some super-efficient server I guess, so I would need to save their proper links in my MongoDB. So I guess I will have links like https://cdnjs.cloudflare.com/bla/data/audio1.mp3 (for example) - But how do I generate these links and pop them into my MongoDB database?

I'm not sure if I understand your question. Just upload your audio to your CDN. It should generate the links for you. You can save these to MongoDB by interfacing directly with the Mongo shell or using the Mongoose ORM. If users are going to be uploading music to your app directly, you will probably be using some external API to upload files. For example, if you wanted to upload Images to the Imgur API, you would send data to their API endpoints for image uploads and their API would automatically return a link to your image. You would need to write a callback function that checks whether the image upload went correctly - if all went well and you don't need to throw an error, you would have a method written in your callback to create a new document in MongoDB/Mongoose to save that link, following a schema that makes it logically possible to retrieve the location/uploader (also saving a reference to the user who uploaded it, for example)
You would also probably be using HTML's to handle this, if it's a web app
Alternatively, you can set up your own methods on your back-end server for storing and retrieving file uploads, hosting on Amazon will give you a lot of bandwidth to work with.

I am getting an error using LOAD in neo4j. What's my error?

I am trying to load data into neo4j using a local csv on my system:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/Users/jlujan/Desktop/northwind/customers.csv" AS row
CREATE (:Customer {CustomerID: row.CustomerID, CompanyName: row.CompanyName,
ContactName: row.ContactName, ContactTitle: row.ContactTitle,
Address: row.Address, City: row.City, Region: row.Region, PostalCode: row.PostalCode, Country: row.Country, Phone: row.Phone, Fax: row.Fax});
Every time I get this error: Couldn't load the external resource at: file:/var/lib/neo4j/import/Users/jlujan/Desktop/northwind/customers.csv
I think it's a URL issue, but I'm not exactly sure what. Please help!

Looks like you're using Neo4j 3?
You'll find a setting in neo4j.conf like this
# This setting constrains all `LOAD CSV` import files to be under the `import` directory. Remove or uncomment it to
# allow files to be loaded from anywhere in filesystem; this introduces possible security problems. See the `LOAD CSV`
# section of the manual for details.
dbms.directories.import=import
If you remove this/comment it, Neo4j should allow loading files from anywhere in the system

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Data Connection - downloading files from multiple URLs in one sync - palantir-foundry

How can I use Data Connection to download files from a large list of URLs as part of a single sync? I want to be able to do this in parallel, since I'll be downloading 1–2,000 new files per day. I also want all the files to be stored in a single dataset.

Related

How to upload csv data that contains newline with dbt

Copying multiple files from one folder to another in the same S3 bucket

What is an alternative to CSV data set config in JMeter?

How do I populate MongoDB with my CDN Links?

I am getting an error using LOAD in neo4j. What's my error?

Categories

Resources