Pyspark read all JSON files from a subdirectory of S3 bucket - json

I am trying to read JSON files from a subdirectory called world from a S3 bucket named hello. When I list all the objects of that directory using boto3, I can see several part files(which were possibly created by a spark job) like below.
world/
world/_SUCCESS
world/part-r-00000-....json
world/part-r-00001-....json
world/part-r-00002-....json
world/part-r-00003-....json
world/part-r-00004-....json
world/part-r-00005-....json
world/part-r-00006-....json
world/part-r-00007-....json
I have written the following code to read all these files.
spark_session = SparkSession
.builder
.config(
conf=SparkConf().setAll(spark_config).setAppName(app_name)
).getOrCreate()
hadoop_conf = spark_session._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", "my-aws-access-key")
hadoop_conf.set("fs.s3a.secret.key", "my-aws-secret-key")
hadoop_conf.set("com.amazonaws.services.s3a.enableV4", "true")
df = spark_session.read.json("s3a://hello/world/")
and getting the following error
py4j.protocol.Py4JJavaError: An error occurred while calling o98.json.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: , AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID:
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:557)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:392)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
I have tried with "s3a://hello/world/*" and "s3a://hello/world/*.json"as well but still getting the same error.
FYI, I am using the following versions of the tools:
pyspark 2.4.5
com.amazonaws:aws-java-sdk:1.7.4
org.apache.hadoop:hadoop-aws:2.7.1
org.apache.hadoop:hadoop-common:2.7.1
Can anyone help me with this?

it seems that the credentials you are using to access the bucket/ folder doesn't have required access right .
Please check the following things
Credentials or role specified in your application code
Policy attached to the Amazon Elastic Compute Cloud (Amazon EC2)
instance profile role
Amazon S3 VPC endpoint policy
Amazon S3 source and destination bucket policies
Few things which you can use to debug quickly
on your master node of the cluster try to access the bucket using
aws s3 ls s3://hello/world/
if this throws the error try to resolve the access control by following this link
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-403-access-denied/

Related

how to copy existing file from AWS s3 bucket which is private

I have used copy function for copying the file.My code work without using s3 bucket both in local and IIS But when i used S3 bucket i got this error.
Error executing "PutObject" on
"{bucket_path}/local-iportalDocuments/96/building.jpeg"; AWS HTTP
error: Client error: PUT {{bucket_path}}/local-iportalDocuments/96/building.jpeg resulted in a
403 Forbidden response:\n\nAccessDeniedAccess
Denied3BGXVF (truncated...)\n AccessDenied
(client): Access Denied - \nAccessDeniedAccess
Denied3BGXVF8J5RST47DW9tUMIzQGZyiKdb4+Vwj10EWFxUiMYzmdaCCNVVfzuSRzAhj4YvVssE0+OA12IeW3WTu2K+POr0s=",
"exception": "Aws\S3\Exception\S3Exception
I use this line of code for copy :
$mediaItem = $oldModel->getMedia('document_files')->find($file['id']);
$copiedMediaItem = $mediaItem->copy($model, 'document_files', config('app.media_disk'));
..Need Help

Getting OAuth Refresh Token Error UNAUTHORISED ERROR when hitting Bigtable

I have an app that uses Bigtable-hbase api for creating an Bigtable Connection using Service Account File.
This works fine in the local and sometimes on weblogic server also.
But after some request through the api , I am getting the following error -
io.grpc.StatusRuntimeException: UNAUTHENTICATED: Unexpected failure get auth token at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.auth.oauth2.ServiceAccountCredentials.refreshAccessToken(ServiceAccountCredentials.java:365)
at com.google.cloud.bigtable.grpc.io.RefreshingOAuth2CredentialsInterceptor.refreshCredentials(RefreshingOAuth2CredentialsInterceptor.java:379)
at com.google.cloud.bigtable.grpc.io.RefreshingOAuth2CredentialsInterceptor.access$100(RefreshingOAuth2CredentialsInterceptor.java:60)
at com.google.cloud.bigtable.grpc.io.RefreshingOAuth2CredentialsInterceptor$2.call(RefreshingOAuth2CredentialsInterceptor.java:328)
at com.google.cloud.bigtable.grpc.io.RefreshingOAuth2CredentialsInterceptor$2.call(RefreshingOAuth2CredentialsInterceptor.java:325)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)>
I am using following configuration to create the connection -
Configuration config = BigtableConfiguration.configure(projectId,instanceId);
config.set(BigtableOptionsFactory.BIGTABLE_SERVICE_ACCOUNT_JSON_KEYFILE_LOCATION_KEY, new File(filePath).toString());
Connection btConnection=ConnectionFactory.createConnection(config);
//then the code to read and write into the table`...
These code works fine sometimes and after some request it is throwing the above error.
I need to reason why it is happening and how can I resolve it so that application works fine when we sent the bulk request.
I think issue might be access refresh token , but how can I solve it?

How to upload a file to OCI Object storage

I am trying to use UploadObjectExample.java code to upload a file to OCI object storage. I am running into connection timeout error while connecting to the object storage URL. The same config file is used by OCI CLI to successfully upload files to OCI config.
Here is the Error log:
Exception in thread "main" com.oracle.bmc.model.BmcException: (-1, null, true) Timed out while communicating to: https://objectstorage.us-ashburn-1.oraclecloud.com (outbound opc-request-id: 1EB5AA4A7FD64D58A54F876AD0C9E83B)
at com.oracle.bmc.http.internal.RestClient.convertToBmcException(RestClient.java:572)
at com.oracle.bmc.http.internal.RestClient.put(RestClient.java:380)
at com.oracle.bmc.objectstorage.ObjectStorageClient.putObject(ObjectStorageClient.java:1053)
at com.oracle.bmc.objectstorage.transfer.internal.SimpleRetry$1.apply(SimpleRetry.java:34)
at com.oracle.bmc.objectstorage.transfer.internal.SimpleRetry$1.apply(SimpleRetry.java:26)
at com.oracle.bmc.objectstorage.transfer.UploadManager.singleUpload(UploadManager.java:111)
at com.oracle.bmc.objectstorage.transfer.UploadManager.upload(UploadManager.java:73)
at UploadObjectExample.main(UploadObjectExample.java:74)
Caused by: javax.ws.rs.ProcessingException: java.net.SocketTimeoutException: connect timed out
at org.glassfish.jersey.client.internal.HttpUrlConnector.apply(HttpUrlConnector.java:284)
at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:278)
at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$0(JerseyInvocation.java:753)
at org.glassfish.jersey.internal.Errors.process(Errors.java:316)
at org.glassfish.jersey.internal.Errors.process(Errors.java:298)
at org.glassfish.jersey.internal.Errors.process(Errors.java:229)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:752)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:445)
at org.glassfish.jersey.client.JerseyInvocation$Builder.put(JerseyInvocation.java:334)
at com.oracle.bmc.http.internal.ForwardingInvocationBuilder.put(ForwardingInvocationBuilder.java:141)
at com.oracle.bmc.http.internal.RestClient.put(RestClient.java:377)
Please test curl -v https://objectstorage.us-ashburn-1.oraclecloud.com from the same machine where the Java client times out, just to make sure there are no connection issues. If it works fine you may try to change the timeout value in ClientConfiguration. You can see more details here: https://github.com/oracle/oci-java-sdk/issues/92
Before creating a support ticket you might also try to create a new issue on github/oci-java-sdk.
without knowing more about the config file (I do not suggest you post it here), your home region and other elements it is very hard to help.
I would suggest you open a support ticket at https://support.oracle.com, making sure that you select the Cloud tab and the Service as "Oracle Cloud Infrastructure".
Are you using a proxy? If so, you may need to use the OCI Java SDK ApacheConnector.
This was an issue with the proxy. This was resolved by using the ash7 proxy.

Connecting to CloudSQL Mysql over ssl from external application

I am trying to get a sample java application to connect to a Mysql gen2 instance I have in GCP. I use SSL and the ip address is whitelisted. I have confirmed connectivity to the instance using the mysql command line and passing in the client-cert.pem, client-key.pem and the server-ca.pem. Now inorder to connect to it from the spring boot java application, I did the following:
created a p12 file from the client cert and key and added it to keystore.jks
created a truststore with the server-ca.pem file.
Added this code in the main before the connection is created:
System.setProperty("javax.net.debug", "all");
System.setProperty("javax.net.ssl.trustStore", TRUST_STORE_PATH);
System.setProperty("javax.net.ssl.trustStorePassword", "fake_password");
System.setProperty("javax.net.ssl.keyStore", KEY_STORE_PATH);
System.setProperty("javax.net.ssl.keyStorePassword", "fake_password");
For the jdbc url, I used : jdbc:mysql://1.1.1.1:3306/sampledb?useSSL=true&requireSSL=true
However I am unable to connect to the instance and see this error from the java ssl debug:
restartedMain, RECV TLSv1.1 ALERT: fatal, unknown_ca
%% Invalidated: [Session-2, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA]
restartedMain, called closeSocket()
restartedMain, handling exception: javax.net.ssl.SSLHandshakeException: Received fatal alert: unknown_ca
restartedMain, called close()
restartedMain, called closeInternal(true)
I also tried to run
openssl verify -CAfile server-ca.pem client-cert.pem`
and got this output:
error 20 at 0 depth lookup:unable to get local issuer certificate`
Any ideas on what I might be doing wrong?

Cloudbees AWS Elastic Beanstalk deployment - application not found error

I am trying to deploy application from Jenkins build from Dev#cloud to AWS
using the instructions given at
https://developer.cloudbees.com/bin/view/DEV/ElasticBeanstalk
However, I am stuck because "cloudbees-deployer:elastic-beanstalk" is not
able to locate my application at AWS.
Here is the Console output from Jenkins Build
[cloudbees-deployer:elastic-beanstalk] Checking if S3 bucket
'photoid-reports-aws' exists...
[cloudbees-deployer:elastic-beanstalk] Checking if S3 bucket
'photoid-reports-aws' location...
[cloudbees-deployer:elastic-beanstalk] S3 bucket 'photoid-reports-aws'
location matches: us-east-1
[cloudbees-deployer:elastic-beanstalk] Uploading application to S3
bucket 'photoid-reports-aws/jenkins-photoid-reports-aws-9'...
[cloudbees-deployer:elastic-beanstalk] Application uploaded to S3
bucket 'photoid-reports-aws' with key
'jenkins-photoid-reports-aws-9/deploytest', version id 'null' and eTag
'427d78c1e5bfbaa7a1d10f46280236cc-8'
[cloudbees-deployer:elastic-beanstalk] Checking if application version
'prod-build' exists...
[cloudbees-deployer:elastic-beanstalk] Creating application version
'prod-build'...
com.cloudbees.plugins.deployer.exceptions.DeployException: No
Application named 'deploytest' found. (Service: AWSElasticBeanstalk;
Status Code: 400; Error Code: InvalidParameterValue; Request ID:
0cc70036-470e-11e4-90e5-1717b7862a74)
at com.cloudbees.plugins.deployer.engines.Engine.process(Engine.java:185)
at com.cloudbees.plugins.deployer.engines.Engine.perform(Engine.java:119)
at com.cloudbees.plugins.deployer.DeployBuilder.perform(DeployBuilder.java:104)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:825)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:160)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:606)
at hudson.model.Run.execute(Run.java:1684)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:232)
Caused by: com.amazonaws.AmazonServiceException: No Application named
'deploytest' found. (Service: AWSElasticBeanstalk; Status Code: 400;
Error Code: InvalidParameterValue; Request ID:
0cc70036-470e-11e4-90e5-1717b7862a74)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:820)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:439)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:245)
at com.amazonaws.services.elasticbeanstalk.AWSElasticBeanstalkClient.invoke(AWSElasticBeanstalkClient.java:1679)
at com.amazonaws.services.elasticbeanstalk.AWSElasticBeanstalkClient.createApplicationVersion(AWSElasticBeanstalkClient.java:540)
at com.cloudbees.plugins.deployer.impl.amazon.EngineImpl$DeployFileCallable.invoke(EngineImpl.java:355)
at com.cloudbees.plugins.deployer.impl.amazon.EngineImpl$DeployFileCallable.invoke(EngineImpl.java:224)
at com.cloudbees.plugins.deployer.engines.Engine$FingerprintingWrapper.invoke(Engine.java:271)
at com.cloudbees.plugins.deployer.engines.Engine$FingerprintingWrapper.invoke(Engine.java:259)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2462)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:328)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Build step 'Deploy applications' marked build as failure
Finished: FAILURE
Interesting. It looks like Cloudbees is assuming that you already have an application named "deploytest". The log looks like it is only trying to create a new application version as you can see after the S3 upload succeeded. It checks to make sure the app-version doesn't exist and then tries to create it.
What happens if you go through the Elastic Beanstalk Console to setup a new application with the name 'deploytest'? Just select the desired Environment Tier, Platform, and then Environment Type and try running that again. When it asks for application version, you can just use the sample app which should be selected by default.
Let me know if that helps.