aws s3 > is "aws s3 cp" command implemented with multithreads?

aws s3 > is "aws s3 cp" command implemented with multithreads? - aws-sdk

I am newbie in using aws s3 client. I tried to use "aws s3 cp" command to download batch of files from s3 to local file system, it is pretty fast. But I then tried to only read all the contents of the batch of files in a single thread loop by using the amazon java sdk API, it is suprisingly several times slower then the given "aws s3 cp" command :<
Anyone know what is the reason? I doubted that "aws s3 cp" is multi-threaded

If you looked at the source of transferconfig.py, it indicates that the defaults are:
DEFAULTS = {
'multipart_threshold': 8 * (1024 ** 2),
'multipart_chunksize': 8 * (1024 ** 2),
'max_concurrent_requests': 10,
'max_queue_size': 1000,
}
which means that it can be doing 10 requests at the same time, and that it also chunks the transfers into 8MB pieces when the file is larger than 8MB
This is also documented on the s3 cli config documentation.
These are the configuration values you can set for S3:
max_concurrent_requests - The maximum number of concurrent requests.
max_queue_size - The maximum number of tasks in the task queue. multipart_threshold - The size threshold the CLI uses for multipart transfers of individual files.
multipart_chunksize - When using multipart transfers, this is the chunk size that the CLI uses for multipart transfers of individual files.
You could tune it down, to see if it compares with your simple method:
aws configure set default.s3.max_concurrent_requests 1
Don't forget to tune it back up afterwards, or else your AWS performance will be miserable.

Related

gunicorn multiple worker can't read data from minIO

I have a fastapi app that before starting takes some data from a minio instance.
This is the main.py:
def get_app() -> FastAPI:
app = FastAPI(title=APP_NAME,
description=APP_DESCRIPTION,
version=APP_VERSION,
openapi_tags=TAGS_METADATA)
app.include_router(api_router, prefix=API_PREFIX)
#logger = CustomizeLogger().make_logger(config_path)
app.logger = logger
app.add_event_handler("startup", start_app_handler(app))
app.add_event_handler("shutdown", stop_app_handler(app))
return app
app = get_app()
Inside start_app_handler this is what's happening:
client = Minio(
endpoint=MINIO_HOST,
access_key=MINIO_ACCESS_KEY,
secret_key=MINIO_SECRET_KEY,
http_client=httpclient,
secure=True
)
# file_names is a list of files stored in a minIO bucket
for file_name in file_names:
response = self.client.get_object(bucket_name=bucket,
object_name=file_name)
# Read file, manipulate etc.
Everything is packed as a python package and installed in a docker image with final command:
CMD ["gunicorn","--timeout","900","--log-level","error", "-b", "0.0.0.0:80","--worker-class=uvicorn.workers.UvicornWorker", "--workers=9", "app.main:app"]
The whole api is deployed on a k3s cluster, 2 nodes with 4 cpus each and 32 gb RAM.
When I use kubectl logs pod_name to see what's going on it seems that it can't read a particular file ( about 300MB size).
I tried with only 1 worker ( with multiple threads) and everything is running fine, so i guess this is a problem with gunicorn.
Anyone has any hints that could be useful?

How to change memory allocation for cloud function when deploying by gcloud command

When deploying a cloud function, I'm using command like this.
gcloud functions deploy MyCloudFunction --runtime nodejs8 --trigger-http
Default memory allocation is 256MB. I changed it 1GB using Google cloud console from browser.
Is there a way to change memory allocation when deploying by gcloud command?

You might want to read over the CLI documentation for gcloud functions deploy.
You can use the --memory flag to set the memory:
gcloud functions deploy MyCloud Functions ... --memory 1024MB

You may also need to increase the CPU count to be able to increase memory beyond 512 MiB. Otherwise, with the default 0.33 vCPU Cloud Function allocation, I saw errors like the following, where [SERVICE] is the name of your Google Cloud Function below:
ERROR: (gcloud.functions.deploy) INVALID_ARGUMENT: Could not update Cloud Run service [SERVICE]. spec.template.spec.containers[0].resources.limits.memory: Invalid value specified for container memory. For 0.333 CPU, memory must be between 128Mi and 512Mi inclusive.
From https://cloud.google.com/run/docs/configuring/cpu#command-line, this can be done by calling gcloud run services update [SERVICE] --cpu [CPU], for example:
gcloud run services update [SERVICE] --cpu=4 --memory=16Gi --region=northamerica-northeast1
You should see a response like:
Service [SERVICE] revision [SERVICE-*****-***] has been deployed and is serving 100 percent of traffic.
https://console.cloud.google.com/run can help show what is happening too.

Programmatically check data transfer on IPFS

We are building a desktop app, on Electron, to share media on IPFS. We want to incentivize the people, who either by an IPFS add or pin, make data available to other users and in effect are "seeding" the data. We want to track how much data is being sent and received by each user, programmatically and periodically.
Is there a standard pattern or a service to be able to do this?
TIA!

On the CLI you can use the ipfs stats bw -p <peer id> command to see the total bytes sent and recieved between your node and the peer id you pass in.
$ ipfs stats bw -p QmeMKDA6HbDD8Bwb4WoAQ7s9oKZTBpy55YFKG1RSHnBz6a
Bandwidth
TotalIn: 875 B
TotalOut: 14 kB
RateIn: 0 B/s
RateOut: 0 B/s
See: https://docs.ipfs.io/reference/api/cli/#ipfs-stats-bw
You can use the ipfs.stats.bw method to the data programatically from the js implementation of IPFS js-ipfs or via the js-ipfs-http-client talking to the http api of a locally running ipfs daemon.
ipfs.stats.bw will show all traffic between to peers, which can include dht queries and other traffic that isn't directly related to sharing blocks of data.
If you want info on just blocks of data shared then you can use ipfs bitswap ledger from the command line.
$ ipfs bitswap ledger QmeMKDA6HbDD8Bwb4WoAQ7s9oKZTBpy55YFKG1RSHnBz6a
Ledger for QmeMKDA6HbDD8Bwb4WoAQ7s9oKZTBpy55YFKG1RSHnBz6a
Debt ratio: 0.000000
Exchanges: 0
Bytes sent: 0
Bytes received: 0
See: https://docs.ipfs.io/reference/api/cli/#ipfs-bitswap-ledger
That api is not directly available in js-ipfs or the js-http-api-client yet.

Increasing the size of the queue in Elasticsearch?

Ive been looking at my elasticsearch logs, and I came across the error
rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#6d32fa18
After looking up the error, the general and consensus was to increase the size of the queue as talked about here - https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
The question I have is how to do I actually do this? Is there aconfiguration file somewhere that I am missing?

From Elasticsearch 5 onward you cannot use the API to update the threadpool search queue size. It is now a node-level settings. See this.
Thread pool settings are now node-level settings. As such, it is not possible to update thread pool settings via the cluster settings API.
To update the threadpool you have to add thread_pool.search.queue_size : <New size> in elasticsearch.yml file of each node and then restart elasticsearch.

To change the queue size one could add it in the config file for each of the nodes as follows:
threadpool.search.queue_size: <new queue size> .
However this would also require a cluster restart.
Up to Elasticsearch 2.x, you can update via the cluster-setting api and this would not require a cluster restart, however this option is gone with Elasticsearch 5.x and newer.
curl -XPUT _cluster/settings -d '{
"persistent" : {
"threadpool.search.queue_size" : <new_size>
}
}'
You can query the queue size as follows:
curl <server>/_cat/thread_pool?v&h=search.queueSize

As of Elasticsearch 6 the type of the search thread pool has changed to fixed_auto_queue_size, which means setting threadpool.search.queue_size in elasticsearch.yml is not enough, you have to control the min_queue_size and max_queue_size parameters as well, like this:
thread_pool.search.queue_size: <new_size>
thread_pool.search.min_queue_size: <new_size>
thread_pool.search.max_queue_size: <new_size>
I recommend using _cluster/settings?include_defaults=true to view the current thread pool settings in your nodes before making any changes. For more information about the fixed_auto_queue_size thread pool read the docs.

Unable to access Google Compute Engine instance using external IP address

I have a Google compute engine instance(Cent-Os) which I could access using its external IP address till recently.
Now suddenly the instance cannot be accessed using its using its external IP address.
I logged in to the developer console and tried rebooting the instance but that did not help.
I also noticed that the CPU usage is almost at 100% continuously.
On further analysis of the Serial port output it appears the init module is not loading properly.
I am pasting below the last few lines from the serial port output of the virtual machine.
rtc_cmos 00:01: RTC can wake from S4
rtc_cmos 00:01: rtc core: registered rtc_cmos as rtc0
rtc0: alarms up to one day, 114 bytes nvram
cpuidle: using governor ladder
cpuidle: using governor menu
EFI Variables Facility v0.08 2004-May-17
usbcore: registered new interface driver hiddev
usbcore: registered new interface driver usbhid
usbhid: v2.6:USB HID core driver
GRE over IPv4 demultiplexor driver
TCP cubic registered
Initializing XFRM netlink socket
NET: Registered protocol family 17
registered taskstats version 1
rtc_cmos 00:01: setting system clock to 2014-07-04 07:40:53 UTC (1404459653)
Initalizing network drop monitor service
Freeing unused kernel memory: 1280k freed
Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 800k freed
Freeing unused kernel memory: 1584k freed
Failed to execute /init
Kernel panic - not syncing: No init found. Try passing init= option to kernel.
Pid: 1, comm: swapper Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Call Trace:
[] ? panic+0xa7/0x16f
[] ? init_post+0xa8/0x100
[] ? kernel_init+0x2e6/0x2f7
[] ? child_rip+0xa/0x20
[] ? kernel_init+0x0/0x2f7
[] ? child_rip+0x0/0x20
Thanks in advance for any tips to resolve this issue.
Mathew

It looks like you might have an script or other program that is causing you to run out of Inodes.
You can delete the instance without deleting the persistent disk (PD) and create a new vm with a higher capacity using your PD, however if it's an script causing this, you will end up with the same issue. It's always recommended to backup your PD before making any changes.
Run this command to find more info about your instance:
gcutil --project= getserialportoutput
If the issue still continue, you can either
- Make a snapshot of your PD and make a PD's copy or
- Delete the instance without deleting the PD
Attach and mount the PD to another vm as a second disk, so you can access it to find what is causing this issue. Visit this link https://developers.google.com/compute/docs/disks#attach_disk for more information on how to do this.
Visit this page http://www.ivankuznetsov.com/2010/02/no-space-left-on-device-running-out-of-inodes.html for more information about inodes troubleshooting.

Make sure the Allow HTTP traffic setting on the vm is still enabled.
Then see which network firewall you are using and it's rules.
If your network is set up to use an ephemral IP, it will be periodically released back. This will cause your IP to change over time. Set it to static/reserved then (on networks page).
https://developers.google.com/compute/docs/instances-and-network#externaladdresses

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008