Google Cloud Composer DAG relative directory in GCS bucket [duplicate] - output

This question already has answers here:
How can I download and access files using Cloud Composer?
(2 answers)
Closed 4 years ago.
What is the proper way to access the root folder of the Composer's instance GCS bucket or any other airflow's folder (like /data) to save task's output file for a simple DAG:
import logging
from os import path
from datetime import datetime
import numpy as np
import pandas as pd
from airflow import models
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def write_to_file():
df = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(5, 5)),
columns=['a', 'b', 'c', 'd', 'e'])
logging.info("Saving results")
file_path = path.join("output.csv")
df.to_csv(path_or_buf=file_path, index=False)
with models.DAG(dag_id='write_to_file',
schedule_interval='*/10 * * * *',
default_args={'depends_on_past': False,
'start_date': datetime(2018, 9, 8)}) as dag:
t_start = DummyOperator(task_id='start')
t_write = PythonOperator(
task_id='write',
python_callable=write_to_file
)
t_end = DummyOperator(task_id='end')
t_start >> t_write >> t_end
Is there some environment variable set or should I use GCS hook?

I got answer on the composer mailing list "if you save operator output data to /home/airflow/gcs/data, it will be auto synced to the gs://{composer-bucket}/data".

Related

Create dynamic frame from S3 bucket AWS Glue

Summary:
I've got a S3 bucket which contains list of JSON files. Bucket contains child folders which are created by date. All the files contain similar file structure. Files get added on daily basis.
JSON Schema
schema = StructType([
StructField("main_data",StructType([
StructField("action",StringType()),
StructField("parameters",StructType([
StructField("project_id",StringType()),
StructField("integration_id",StringType()),
StructField("cohort_name",StringType()),
StructField("cohort_id",StringType()),
StructField("cohort_description",StringType()),
StructField("session_id",StringType()),
StructField("users",StructType([StructField("user_id",StringType())]))
]),
)]
)),
StructField("lambda_data", StructType([
StructField("date",LongType())
]))
])
Question
I am trying to create dynamic frame from options where source is S3 and type is JSON. I'm using following code however it is not returning any value. Where am I going wrong?
Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from functools import reduce
from awsglue.dynamicframe import DynamicFrame
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = glueContext.create_dynamic_frame.from_options(
connection_type = 's3',
connection_options={'paths':'Location for S3 folder'},
format='json',
# formatOptions=$..*
)
print('Total Count:')
df.count()
Can you check if your Glue Role has access to the S3 bucket,
Also in connection_options add
"recurse" : True

Error Loading Delimited file into MySQL using Airflow( Error code 2068)

I have airflow installed on Ubuntu as WSL on windows.
I am trying to load a delimited file that is stored on my C drive into Mysql database using the code below:
import logging
import os
import csv
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.hooks.mysql_hook import MySqlHook
def bulk_load_sql(table_name, **kwargs):
local_filepath = 'some c drive path'
conn = MySqlHook(conn_name_attr='mysql_default')
conn.bulk_load(table_name, local_filepath)
return table_name
dag = DAG(
"dag_name",
start_date=datetime.datetime.now() - datetime.timedelta(days=1),
schedule_interval=None)
t1 = PythonOperator(
task_id='csv_to_stgtbl',
provide_context=True,
python_callable=bulk_load_sql,
op_kwargs={'table_name': 'mysqltablnm'},
dag=dag
)
It gives the following exception:
MySQLdb._exceptions.OperationalError: (2068, 'LOAD DATA LOCAL INFILE file request rejected due to restrictions on access.')
I have checked the following setting on mysql and its ON
SHOW GLOBAL VARIABLES LIKE 'local_infile'
Could someone please provide some pointers as to how to fix it.
Is there any other way I can load a delimited file into mysql using airflow.
For now, I have implemented a work around as follows:
def load_staging():
mysqlHook = MySqlHook(conn_name_attr='mysql_default')
#cursor = conn.cursor()
conn = mysqlHook.get_conn()
cursor = conn.cursor()
csv_data = csv.reader(open('c drive file path'))
header = next(csv_data)
logging.info('Importing the CSV Files')
for row in csv_data:
#print(row)
cursor.execute("INSERT INTO table_name (col1,col2,col3) VALUES (%s, %s, %s)",
row)
conn.commit()
cursor.close()
t1 = PythonOperator(
task_id='csv_to_stgtbl',
python_callable=load_staging,
dag=dag
)
However, it would have been great if the LOAD DATA LOCAL INFILE would have worked.

mySQL export to GCP cloud-storage

I have mySQL running on-prem and would like to migrate it with mySQL running on Cloud SQL (GCP). I first want to export tables to Cloud Storage as JSON files and then from there move them to mySQL (cloud-sql) & Big Query.
Now I wonder how I should do this - export each table as JSON or just dump the whole database to cloud storage? (we might need to change schemas for some tables that's why im thinking to do it 1 by 1).
Is there any way doing it with python pandas?
I found this --> Pandas Dataframe to Cloud Storage Bucket
but don't understand how to connect this to my GCP's cloud storage, and how to do this mycursor.execute("SELECT * FROM table") for all my tables.
EDIT 1:
so i came up with this, but this works only for the selected schema + table. how can I do this for all tables in the schema??
#!/usr/bin/env python3
import mysql.connector
import pandas as pd
from google.cloud import storage
from google.oauth2 import service_account
import os
import csv
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/home/python2/key.json"
#export GOOGLE_APPLICATION_CREDENTIALS="/home/python2/key.json"
#credentials = storage.Client.from_service_account_json('/home/python2/key.json')
#credentials = service_account.Credentials.from_service_account_file('key.json')
mydb = mysql.connector.connect(
host="localhost", user="root", passwd="pass_word", database="test")
mycursor = mydb.cursor(named_tuple=True)
mycursor.execute("SELECT * FROM test")
myresult = mycursor.fetchall()
df = pd.DataFrame(data=myresult)
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-buckets-1234567")
blob = bucket.blob("file.json")
df = pd.DataFrame(data=myresult).to_json(orient='records')
#df = pd.DataFrame(data=myresult).to_csv(sep=";", index=False, quotechar='"', quoting=csv.QUOTE_ALL, encoding="UTF-8")
blob.upload_from_string(data=df)

How to read jsonl.gz file stored in an s3 bucket using Boto3-Python3

I have a few files in my s3 bucket that is stored as .GZ files. I am using boto3 to access those file and I am trying to read the file's contents.
However, I keep getting this error when I run my code:
OSError: [Errno 9] read() on write-only GzipFile object
Here is my code:
import boto3
import os
import json
from io import BytesIO
import gzip
from gzip import GzipFile
from datetime import datetime
import logging
import botocore
# AWS Bucket Info
BUCKET_NAME = '<my_bucket_name>'
#My bucket's key information to where the .GZ files are stored
key1 = 'my/path/to/file/shar1-jsonl.gz'
# key2 = 'my/path/to/file/shar2-jsonl.gz'
# key3 = 'my/path/to/file/shar3-jsonl.gz'
# Create s3 connection
s3_resource = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
zip_obj = s3_resource.Object(bucket_name=BUCKET_NAME, key=key1)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = gzip.open(buffer,'wb').read().decode(('utf-8'))
Is there any way to collect the jsonl.gz file and then read its contents using boto3? I am new to boto3 and gzip files so any ideas or suggestion would help

how to upload and read a zip file containing training and testing images data from google colab from my pc

I am new to google colab. I am implementing a pretrained vgg16 and resnet50 model using pytorch, but I am unable to load my file and read it as it returns an error of no directory found
I have uploaded the data through file also I have used to upload it using
from google.colab import files
uploaded = files.upload()
The file got uploaded but when I tried to unzip it because it is a zip file using
!unzip content/cropped_months
then it says
no file found
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision.transforms import *
from torch.optim import lr_scheduler
from torch.autograd import Variable
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
from google.colab import files
uploaded = files.upload()
!unzip content/cropped_months
data_dir = 'content/cropped_months'
​
#Define transforms for the training data and testing data
train_transforms = transforms.Compose([transforms.RandomRotation(30),transforms.RandomResizedCrop(224),transforms.RandomHorizontalFlip(),transforms.ToTensor(),transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
​
test_transforms = transforms.Compose([transforms.Resize(256),transforms.CenterCrop(224),transforms.ToTensor(),transforms.Normalize([0.485, 0.456, 0.406],[0.229, 0.224, 0.225])])
​
#pass transform here-in
train_data = datasets.ImageFolder(data_dir + '/train', transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transforms)
​
#data loaders
trainloader = torch.utils.data.DataLoader(train_data, batch_size=8, shuffle=True)
testloader = torch.utils.data.DataLoader(test_data, batch_size=8, shuffle=True)
​
print("Classes: ")
class_names = train_data.classes
print(class_names)
first error
unzip: cannot find or open content/cropped_months,
content/cropped_months.zip or content/cropped_months.ZIP.
second error
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call
last) in ()
16
17 #pass transform here-in
---> 18 train_data = datasets.ImageFolder(data_dir + '/train', transform=train_transforms)
19 test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transforms)
20
2 frames
/usr/local/lib/python3.6/dist-packages/torchvision/datasets/folder.py
in _find_classes(self, dir)
114 if sys.version_info >= (3, 5):
115 # Faster and available in Python 3.5 and above
--> 116 classes = [d.name for d in os.scandir(dir) if d.is_dir()]
117 else:
118 classes = [d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d))]
FileNotFoundError: [Errno 2] No such file or directory:
'content/cropped_months (1)/train'
You are probably trying to access the wrong path. In my notebook, the file was uploaded to the working directory.
Use google.colab.files to upload the zip.
from google.colab import files
files.upload()
Upload your file. Google Colab will display where it was saved:
Saving dummy.zip to dummy.zip
Then just run !unzip:
!unzip dummy.zip
I think you can use PySurvival library is compatible with Torch , here the link :
https://square.github.io/pysurvival/miscellaneous/save_load.html