AWS Glue Job - CSV to Parquet. How to ignore header? - csv

I need to convert a bunch (23) of CSV files (source s3) into parquet format. The input CSV contains headers in all files. When I generated code for that using Glue. The output contains 22 header rows also in separate rows which means it ignored the first header. I need help in ignoring all the headers while doing this transformation.
Since I'm using from_catalog function for my input, I don't have any format_options to ignore the header rows.
Also, can I set an option in the Glue table that the header is present in the files? Will that automatically ignore the header when my job runs?
Part of my current approach is below. I'm new to Glue. This code was actually auto-generated by Glue.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://my-bucket-name/full/s3/path-parquet"}, format = "parquet", transformation_ctx = "datasink1")

Faced exact issue while working on a ETL job which used AWS Glue.
The documentation for from_catalog says:
additional_options – A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except for endpointUrl, streamName, bootstrap.servers, security.protocol, topicName, classification, and delimiter.
I tried using the below snippet and some of its permutations with from_catalog. But nothing worked for me.
additional_options = {"format": "csv", "format_options": '{"withHeader": "True"}'},
One way to go about fixing this is by using from_options instead of from_catalog and pointing directly to the S3 bucket or folder. This is what it should look like:
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={
'paths': ['s3://bucket_name/folder_name'],
"recurse": True,
'groupFiles': 'inPartition'
},
format="csv",
format_options={
"withHeader": True
},
transformation_ctx = "datasource0"
)
But if you can't do this for any reason and want to stick with from_catalog, using a filter worked for me.
Assuming that one of your header's name is name, this is what the snippet can look like:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["name"] != "name")
Not very sure about how spark's dataframes or glue's dynamicframes deal with csv headers and why data read from catalog had headers in rows as well as schema, but this seemed to solve my issue by removing the header values from the rows.

Related

unable to load csv from GCS bucket to BigQuery table accurately

I am trying to load the airbnb_nyc data set from GCS bucket to BigqueryTable. Link to the dataset.
I am using the following Code:
def parse_file(element):
for line in csv.reader([element],delimiter=','):
return line
class DataIngestion2:
def parse_method2(self, values):
row1 = dict(
zip(('id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'),
values))
return row1
with beam.Pipeline(options=pipeline_options) as p:
lines= p | 'Read' >> ReadFromText(known_args.input,skip_header_lines=1)\
| 'parse' >> beam.Map(parse_file)
pipeline2 = lines | 'Format to Dict _ original CSV' >> beam.Map(lambda x: data_ingestion2.parse_method2(x))
pipeline2 | 'Load2' >> beam.io.WriteToBigQuery(table_spec, schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
`
But my output on BigQuery Table is wrong.
I am only getting values for the first two columns and the rest of the 14 columns are showing NULL. I am not able to figure out what I am doing wrong. Can Someone Help me find the error in my logic. I basically want to know how to transfer a csv from GCS bucket to BigQuery through DataFlow pipeline.
Thank you,
You can use the ReadFromText method and then create your own transform by extending beam.DoFn. Attached the code below for reference.
https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText
Note that you can use gs:// for GCS in file_pattern.
More details about Pardo and DoFn
https://beam.apache.org/documentation/programming-guide/#pardo
import apache_beam as beam
from apache_beam.io.textio import ReadAllFromText,ReadFromText
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from apache_beam.io.gcp.gcsio import GcsIO
import csv
COLUMN_NAMES = ['id','name','host_id','host_name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count','availability_365']
def files(path='gs:/some/path'):
return list(GcsIO(storage_client='<ur storage client>').list_prefix(path=path).keys())
def transform_csv(element):
rows = []
with open(element,newline='\r\n') as f:
itr = csv.reader(f, delimiter = ',',quotechar= '"')
skip_head = next(itr)
for row in itr:
rows.append(row)
return rows
def to_dict(element):
rows = []
for item in element:
row_dict = {}
zipped = zip(COLUMN_NAMES,item)
for key,val in zipped:
row_dict[key] =val
rows.append(row_dict)
yield rows
with beam.Pipeline() as p:
read =(
p
|'read-file'>> beam.Create(files())
|'transform-dict'>>beam.Map(transform_csv)
|'list-to-dict'>>beam.FlatMap(to_dict )
|'print'>>beam.Map(print)
#|'write-to-bq'>>WriteToBigQuery(schema=COLUMN_NAMES,table='ur table',project='',dataset='')
)
EDITED1 The ReadFromText supports \r\n as newline char.But,this fails to consider the condition where column data itself has \r\n. Updating the code below.
EDITED 2 GcsIo error fixed.
Note - I have used GCSIO for getting the list of files.
Details here
Please Up-vote and mark as answer if this helps.
Let me suggest another approch for this use case. BiqQuery offers special feature for uploading from Google Could Storage (GCS) to Bigquery. You can load data in several formats and CSV is among them.
There is nice tutorial on Google documentation explaining how to do it. You do not have to use Dataflow or apache_beam. Such process is available through BigQuery API itself.
This is working in many languages, but you do not have to use any language as such process can be done from console or via Cloud SDK using bq command. Everything can be found in mentioned tutorial.

Python3 Replacing special character from .csv file after convert the same from JSON

I am trying to develop a program using Python3.6.4 which convert a JSON file into a CSV file and also we need to clean the data in the csv file. as for example:
My JSON File:
{emp:[{"Name":"Bo#b","email":"bob#gmail.com","Des":"Unknown"},
{"Name":"Martin","email":"mar#tin#gmail.com","Des":"D#eveloper"}]}
Problem 1:
After converting that into csv its creating a blank row between every 2 rows. As
**Name email Des**
[<BLANK ROW>]
Bo#b bob#gmail.com Unknown
[<BLANK ROW>]
Martin mar#tin#gmail.com D#eveloper
Problem 2:
In my code I am using emp but I need to use it dynamically.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read()
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
emp_data = employee_parsed['employee']
As we will not know the structure or content of up-coming JSON file.
Problem 3:
I also need to remove all # characters from the CSV file.
For solving Problem 3, you can use .replace (https://www.tutorialspoint.com/python/string_replace.htm).
For problem 2, you can use the dictionary keys and then get the zeroth item out of it.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read().replace("#", "")
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
first_key = employee_parsed.keys()[0]
emp_data = employee_parsed[first_key]
I can't solve problem 1 without more code to see how your are exporting the result. It may be that your data has newlines in it. In which case, you could add .replace("\n","") and/or .replace("\r","") after the previous replace so the line would read fobj.read().replace("#", "").replace("\n", "").replace("\r", "").

Confusion when uploading a JSON from googlecloud storage to bigquery

Hello this is a 2 part question
1) Currently I am trying to upload a file from google cloud storage to bigquery via a python script. I am trying to follow the steps given by the google help site.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#bigquery-import-gcs-file-python
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
I am not sure what to put in for the first line of "load_data_from_gcs" because in google cloud there are no tables it is JSON file I am trying to upload. Would the "table" part be the name of the table I am trying to create or is it talking about the bucket because there is no part to specify which bucket I want to pull from.
This is the code I have so far.
import json
import argparse
import time
import uuid
from google.cloud import bigquery
# from google.cloud import storage
def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
bigquery_client = bigquery.Client('dataworks-356fa')
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
part 2)
I want this script to run weekly and be able to either delete the old table and create a new one or either only filter in the non-duplicated data. Whichever is easier.
Thank you for your help.
Not sure what problem you are having but to load data from a file from GCS to BigQuery is exactly how you are already doing.
If you have a table with this schema:
[{"name": "id", "type": "INT64"}, {"name": "name", "type": "STRING"}]
And if you have this file in GCS (located for instance at "gs://bucket/json_data.json"):
{"id": 1, "name": "test1"}
{"id": 2, "name": "test2"}
You'd just need now to set the job object to process a JSON file as input, like so:
def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
bigquery_client = bigquery.Client('dataworks-356fa')
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, "gs://bucket/json_data.json")
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
And just it.
(If you have a CSV file then you have to set your job object accordingly).
As for the second question, it's really a matter of trying it out different approaches and seeing which works best for you.
To delete a table, you'd just need to run:
table.delete()
To remove duplicated data from a table one possibility would be to write a query that removes the duplication and saves the results to the same table. Something like:
query_job = bigquery_client.run_async_query(query=your_query, job_name=job_name)
query_job.destination = Table object
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.begin()

Save content of Spark DataFrame as a single CSV file [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 4 years ago.
Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.
The default behavior is to save the output in multiple part-*.csv files inside the path provided.
How would I save a DF with :
Path mapping to the exact file name instead of folder
Header available in first line
Save as a single file instead of multiple files.
One way to deal with it, is to coalesce the DF and then save the file.
df.coalesce(1).write.option("header", "true").csv("sample_file.csv")
However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.
Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?
Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.
save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder"
file_location = save_location+'export.csv'
df.repartition(1).write.csv(path=csv_location, mode="append", header="true")
file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)
This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).
Use:
df.toPandas().to_csv("sample_file.csv", header=True)
See documentation for details:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.toPandas
df.coalesce(1).write.option("inferSchema","true").csv("/newFolder",header =
'true',dateFormat = "yyyy-MM-dd HH:mm:ss")
The following scala method works in local or client mode, and writes the df to a single csv of the chosen name. It requires that the df fit into memory, otherwise collect() will blow up.
import org.apache.hadoop.fs.{FileSystem, Path}
val SPARK_WRITE_LOCATION = some_directory
val SPARKSESSION = org.apache.spark.sql.SparkSession
def saveResults(results : DataFrame, filename: String) {
var fs = FileSystem.get(this.SPARKSESSION.sparkContext.hadoopConfiguration)
if (SPARKSESSION.conf.get("spark.master").toString.contains("local")) {
fs = FileSystem.getLocal(new conf.Configuration())
}
val tempWritePath = new Path(SPARK_WRITE_LOCATION)
if (fs.exists(tempWritePath)) {
val x = fs.delete(new Path(SPARK_WRITE_LOCATION), true)
assert(x)
}
if (results.count > 0) {
val hadoopFilepath = new Path(SPARK_WRITE_LOCATION, filename)
val writeStream = fs.create(hadoopFilepath, true)
val bw = new BufferedWriter( new OutputStreamWriter( writeStream, "UTF-8" ) )
val x = results.collect()
for (row : Row <- x) {
val rowString = row.mkString(start = "", sep = ",", end="\n")
bw.write(rowString)
}
bw.close()
writeStream.close()
val resultsWritePath = new Path(WRITE_DIRECTORY, filename)
if (fs.exists(resultsWritePath)) {
fs.delete(resultsWritePath, true)
}
fs.copyToLocalFile(false, hadoopFilepath, resultsWritePath, true)
} else {
System.exit(-1)
}
}
This solution is based on a Shell Script and is not parallelized, but is still very fast, especially on SSDs. It uses cat and output redirection on Unix systems. Suppose that the CSV directory containing partitions is located on /my/csv/dir and that the output file is /my/csv/output.csv:
#!/bin/bash
echo "col1,col2,col3" > /my/csv/output.csv
for i in /my/csv/dir/*.csv ; do
echo "Processing $i"
cat $i >> /my/csv/output.csv
rm $i
done
echo "Done"
It will remove each partition after appending it to the final CSV in order to free space.
"col1,col2,col3" is the CSV header (here we have three columns of name col1, col2 and col3). You must tell Spark to don't put the header in each partition (this is accomplished with .option("header", "false") because the Shell Script will do it.
For those still wanting to do this here's how I got it done using spark 2.1 in scala with some java.nio.file help.
Based on https://fullstackml.com/how-to-export-data-frame-from-apache-spark-3215274ee9d6
val df: org.apache.spark.sql.DataFrame = ??? // data frame to write
val file: java.nio.file.Path = ??? // target output file (i.e. 'out.csv')
import scala.collection.JavaConversions._
// write csv into temp directory which contains the additional spark output files
// could use Files.createTempDirectory instead
val tempDir = file.getParent.resolve(file.getFileName + "_tmp")
df.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save(tempDir.toAbsolutePath.toString)
// find the actual csv file
val tmpCsvFile = Files.walk(tempDir, 1).iterator().toSeq.find { p =>
val fname = p.getFileName.toString
fname.startsWith("part-00000") && fname.endsWith(".csv") && Files.isRegularFile(p)
}.get
// move to desired final path
Files.move(tmpCsvFile, file)
// delete temp directory
Files.walk(tempDir)
.sorted(java.util.Comparator.reverseOrder())
.iterator().toSeq
.foreach(Files.delete(_))
The FileUtil.copyMerge() from the Hadoop API should solve your problem.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
See Write single CSV file using spark-csv
This is how distributed computing work! Multiple files inside a directory is exactly how distributed computing works, this is not a problem at all since all software can handle it.
Your question should be "how is it possible to download a CSV composed of multiple files?" -> there are already lof of solutions in SO.
Another approach could be to use Spark as a JDBC source (with the awesome Spark Thrift server), write a SQL query and transform the result to CSV.
In order to prevent OOM in the driver (since the driver will get ALL
the data), use incremental collect
(spark.sql.thriftServer.incrementalCollect=true), more info at
http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/.
Small recap about Spark "data partition" concept:
INPUT (X PARTITIONs) -> COMPUTING (Y PARTITIONs) -> OUTPUT (Z PARTITIONs)
Between "stages", data can be transferred between partitions, this is the "shuffle". You want "Z" = 1, but with Y > 1, without shuffle? this is impossible.

filter a glob-like regex pattern in boto3

Can I use boto3's filter tool for finding keys (technically sub-keys) in a bucket akin to files in a directory using glob?
I want to get a list of keys with a pattern like this "key/**/<pattern>/**.gz".
Unfortunately not. S3 provides no server-side support for filtering of results (other than by prefix and delimiter).
You can use the exrex library to generate all strings based on a regex and pass that to boto3. This is a simple example but you can imagine something a bit more complicated:
For example:
import exrex
import boto3
session = boto3.Session() # profile_name='xyz'
s3 = session.resource('s3')
bucket = s3.Bucket('mybucketname')
prefixes = list(exrex.generate(r'api/v2/responses/2016-11-08/(2016-11-08T2[2-3]|2016-11-09)'))
objects = []
for prefix in prefixes:
print(prefix, end=" ")
current_objects = list(bucket.objects.filter(Prefix=prefix))
print(len(current_objects))
objects += current_objects
This gives output:
api/v2/responses/2016-11-08/2016-11-08T22 1056
api/v2/responses/2016-11-08/2016-11-08T23 1056
api/v2/responses/2016-11-08/2016-11-09 24677
You can do this by (ab)using the paginator and using .gz as the delimiter. Paginator will return the common prefixes of the keys (in this case everything including the .gz file extension not including the bucket name, i.e. the entire Key) and you can do some regex compare against those strings.
I am not guessing at what your <pattern> is here, and the regex I have provided is a bit rough and ready but essentially what you want is this.
import boto3
import re
region = 'ap-southeast-2' ## <- YOUR REGION HERE
s3client = boto3.client('s3', region_name=region)
paginator = s3client.get_paginator('list_objects')
source_bucket = 'MY-BUCKET-NAME'
source_prefix = 'OPTIONAL-PREFIX/NESTED/'
pat = re.compile(r'key\/.+\/<pattern>\/.+.gz')
for result in paginator.paginate(Bucket=source_bucket, Prefix=source_prefix, Delimiter='.gz'):
for prefixes in result.get('CommonPrefixes'):
commonprefix = prefixes.get('Prefix')
key_path = commonprefix.split('/')
m = re.search(pat, key_path[2])
if m is not None:
print(commonprefix)