writing csv file on hdfs but before writing on hdfs how to filter csv file using flume static interceptor - csv

I used below configuration for filtering my csv file but it is unable to filter.
I used static interceptor for filtering, do I need to use any other interceptor? Please suggest me what I need to write in below config file.
My csv file sample below.
Id,Name,First,Last,Display,Job,Department,OfficeNo,OfficePh,MobilePh,Fax,Address,City,State,ZIP,Country
1,chris#contoso.com,Chris,Green,Chris Green,IT Manager,Information Technology,123451,123-555-1211,123-555-6641,9821,1 Microsoft way,Redmond,Wa,98052,United States
2,ben#contoso.com,Ben,Andrews,Ben Andrews,IT Manager,Information Technology,123452,123-555-1212,123-555-6642,9822,1 Microsoft way,Redmond,Wa,98052,United States
3,david#contoso.com,David,Longmuir,David Longmuir,IT Manager,Information Technology,123453,123-555-1213,123-555-6643,9823,1 Microsoft way,Redmond,Wa,98052,United States
4,cynthia#contoso.com,Cynthia,Carey,Cynthia Carey,IT Manager,Information Technology,123454,123-555-1214,123-555-6644,9824,1 Microsoft way,Redmond,Wa,98052,United States
my flume-conf.properties file is listed below, and I am expecting output like (Id=1) go through ch1 and (Id=2) go through ch2 and other Id(3,4) go through default channel.
Please help me out doing it.
a1.sources=src1
a1.channels=ch1 ch2
a1.sinks=s1 s2
a1.sources.src1.type=exec
a1.sources.src1.command=tail -F /home/manish/TwitterExample /Import_User_Sample_en.csv
a1.channels.ch1.type=memory
a1.channels.ch1.capacity=10000
a1.channels.ch1.transactioncapacity=100
a1.channels.ch2.type = memory
a1.channels.ch2.capacity = 10000
a1.channels.ch2.transactioncapacity = 100
Static interceptor as follows
a1.sources.src1.interceptors=i1
a1.sources.src1.interceptors.i1.type=static
a1.sources.src1.interceptor.i1.key=Id
a1.sources.src1.interceptor.i1.value=1
a1.sources.src1.interceptor.i1.preserveExisting=false
a1.sources.src1.interceptor=i2
a1.sources.src1.interceptor.i2.type=static
a1.sources.src1.interceptor.i2.key=Id
a1.sources.src1.interceptor.i2.value=2
a1.sources.src1.interceptor.i2.preserveExisting=false
a1.sources.src1.fileHeader=true
a1.sources.src1.selector.type=multiplexing
a1.sources.src1.selector.header=Id
a1.sources.src1.selector.mapping.1=ch1
a1.sources.src1.selector.mapping.2 =ch2
a1.sources.src1.selector.default = ch2
a1.sinks.s1.type=hdfs
a1.sinks.s1.hdfs.path=hdfs://kdp.ambarikdp1.com:8020/user/data/twitter
a1.sinks.s1.hdfs.filetype=DataStream
a1.sinks.s1.hdfs.rollCount=0
a1.sinks.s1.hdfs.rollSize=0
a1.sinks.s1.hdfs.rollInterval=300
a1.sinks.s1.hdfs.serializer=HEADER_AND_TEXT
a1.sinks.s2.type = hdfs
a1.sinks.s2.hdfs.path = hdfs://kdp.ambarikdp1.com:8020/user/data/t2
a1.sinks.s2.hdfs.filetype = DataStream
a1.sinks.s2.hdfs.rollCount = 0
a1.sinks.s2.hdfs.rollSize = 0
a1.sinks.s2.hdfs.rollInterval = 300
a1.sinks.s2.hdfs.serializer=HEADER_AND_TEXT
a1.sources.src1.channels=ch1 ch2
a1.sinks.s1.channel=ch1
a1.sinks.s2.channel = ch2

At least, the way you are using multiple interceptors is wrong. Interceptors must be added as a list this way:
a1.sources.src1.interceptors = i1 i2
Then, you can configure each one of them:
a1.sources.src1.interceptors.i1.type = ...
a1.sources.src1.interceptors.i1.other = ...
...
a1.sources.src1.interceptors.i2.type = ...
a1.sources.src1.interceptors.i2.other = ...
...
Being said that, I think the static interceptor is not what you need, since AFAIK it always adds the same static header to all the events. I mean, all the events will have the same header name and value, independently of the "id" field.
Please, have a look on How to use regex_extractor selector and multiplexing interceptor together in flume?, i.e. using the Regex Extractor Interceptor instead.

Related

issue with connecting data in databricks from data lake and reading JSON into Folium

i'm working on something based of this blogpost:
https://python-visualization.github.io/folium/quickstart.html#Getting-Started
specifically part 13 - using Cloropleth maps:
the piece of code they use is the following:
import pandas as pd
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Unemployment Rate (%)",
).add_to(m)
folium.LayerControl().add_to(m)
m
if I use this I get the requested map.
Now I try to do this with my own data; i work in databricks
so I have a JSON with the GEOJSON data (source_file1) and a CSV file (source_file2) with the data that needs to be "plotted" on the map.
source_file1 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/Belgie_GEOJSON.JSON"
state_geo = spark.read.json(source_file1,multiLine=True)
source_file2 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/DATASVZ.csv"
df_2 = spark.read.format("CSV").option("inferSchema", "true").option("header", "true").option("delimiter",";").load(source_file2)
state_data = df_2.toPandas()
when adjusting the code below:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
So i upload the geo_data parameter as a Sparkdatafram, I get the following error:
ValueError: Cannot render objects with any missing geometries: DataFrame[features: array<struct<geometry:struct<coordinates:array<array<array<string>>>,type:string>,properties:struct<arr_fr:string,arr_nis:bigint,arr_nl:string,fill:string,fill-opacity:double,name_fr:string,name_nl:string,nis:bigint,population:bigint,prov_fr:string,prov_nis:bigint,prov_nl:string,reg_fr:string,reg_nis:string,reg_nl:string,stroke:string,stroke-opacity:bigint,stroke-width:bigint>,type:string>>, type: string]```
I think it is because transforming the data from the "blob format" in the Azure datalake to the sparkdataframe, something goes wrong with the format. I tested this in a jupyter notebook from my desktop, data straight from file to folium and it all works.
If i load it directly from the source, like the example does with their webpage, so i adjust the 'geo_data' parameter for the folium function:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=source_file1, #this gets adjusted directly to data lake
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
I get the error
Use "/dbfs", not "dbfs:": The function expects a local file path. The error is caused by passing a path prefixed with "dbfs:".
So I started wondering what is the difference between my JSON file and the one of the blogpost. And the only thing i can imagine is that the Azure datalake doesn't store my json as a json but as a block blob file and for some reason i am not converting it properly so that folium can read it.
Azure blob storage (data lake)
So can someone with folium knowledge let me know if
A. it is not possible to load the geo_data directly from a datalake ?
B. in what format I need to upload the data ?
any thoughts on this would be helpfull!!!
thanks in advance!
Solved this issue, just had to replace "dbfs:" with "/dbfs". I tried it a lot of times but used "/dbfs:" and got another error.
can't believe i'm this stupid :-)

unable to load csv from GCS bucket to BigQuery table accurately

I am trying to load the airbnb_nyc data set from GCS bucket to BigqueryTable. Link to the dataset.
I am using the following Code:
def parse_file(element):
for line in csv.reader([element],delimiter=','):
return line
class DataIngestion2:
def parse_method2(self, values):
row1 = dict(
zip(('id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'),
values))
return row1
with beam.Pipeline(options=pipeline_options) as p:
lines= p | 'Read' >> ReadFromText(known_args.input,skip_header_lines=1)\
| 'parse' >> beam.Map(parse_file)
pipeline2 = lines | 'Format to Dict _ original CSV' >> beam.Map(lambda x: data_ingestion2.parse_method2(x))
pipeline2 | 'Load2' >> beam.io.WriteToBigQuery(table_spec, schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
`
But my output on BigQuery Table is wrong.
I am only getting values for the first two columns and the rest of the 14 columns are showing NULL. I am not able to figure out what I am doing wrong. Can Someone Help me find the error in my logic. I basically want to know how to transfer a csv from GCS bucket to BigQuery through DataFlow pipeline.
Thank you,
You can use the ReadFromText method and then create your own transform by extending beam.DoFn. Attached the code below for reference.
https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText
Note that you can use gs:// for GCS in file_pattern.
More details about Pardo and DoFn
https://beam.apache.org/documentation/programming-guide/#pardo
import apache_beam as beam
from apache_beam.io.textio import ReadAllFromText,ReadFromText
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from apache_beam.io.gcp.gcsio import GcsIO
import csv
COLUMN_NAMES = ['id','name','host_id','host_name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count','availability_365']
def files(path='gs:/some/path'):
return list(GcsIO(storage_client='<ur storage client>').list_prefix(path=path).keys())
def transform_csv(element):
rows = []
with open(element,newline='\r\n') as f:
itr = csv.reader(f, delimiter = ',',quotechar= '"')
skip_head = next(itr)
for row in itr:
rows.append(row)
return rows
def to_dict(element):
rows = []
for item in element:
row_dict = {}
zipped = zip(COLUMN_NAMES,item)
for key,val in zipped:
row_dict[key] =val
rows.append(row_dict)
yield rows
with beam.Pipeline() as p:
read =(
p
|'read-file'>> beam.Create(files())
|'transform-dict'>>beam.Map(transform_csv)
|'list-to-dict'>>beam.FlatMap(to_dict )
|'print'>>beam.Map(print)
#|'write-to-bq'>>WriteToBigQuery(schema=COLUMN_NAMES,table='ur table',project='',dataset='')
)
EDITED1 The ReadFromText supports \r\n as newline char.But,this fails to consider the condition where column data itself has \r\n. Updating the code below.
EDITED 2 GcsIo error fixed.
Note - I have used GCSIO for getting the list of files.
Details here
Please Up-vote and mark as answer if this helps.
Let me suggest another approch for this use case. BiqQuery offers special feature for uploading from Google Could Storage (GCS) to Bigquery. You can load data in several formats and CSV is among them.
There is nice tutorial on Google documentation explaining how to do it. You do not have to use Dataflow or apache_beam. Such process is available through BigQuery API itself.
This is working in many languages, but you do not have to use any language as such process can be done from console or via Cloud SDK using bq command. Everything can be found in mentioned tutorial.

AWS Glue Job - CSV to Parquet. How to ignore header?

I need to convert a bunch (23) of CSV files (source s3) into parquet format. The input CSV contains headers in all files. When I generated code for that using Glue. The output contains 22 header rows also in separate rows which means it ignored the first header. I need help in ignoring all the headers while doing this transformation.
Since I'm using from_catalog function for my input, I don't have any format_options to ignore the header rows.
Also, can I set an option in the Glue table that the header is present in the files? Will that automatically ignore the header when my job runs?
Part of my current approach is below. I'm new to Glue. This code was actually auto-generated by Glue.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://my-bucket-name/full/s3/path-parquet"}, format = "parquet", transformation_ctx = "datasink1")
Faced exact issue while working on a ETL job which used AWS Glue.
The documentation for from_catalog says:
additional_options – A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except for endpointUrl, streamName, bootstrap.servers, security.protocol, topicName, classification, and delimiter.
I tried using the below snippet and some of its permutations with from_catalog. But nothing worked for me.
additional_options = {"format": "csv", "format_options": '{"withHeader": "True"}'},
One way to go about fixing this is by using from_options instead of from_catalog and pointing directly to the S3 bucket or folder. This is what it should look like:
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={
'paths': ['s3://bucket_name/folder_name'],
"recurse": True,
'groupFiles': 'inPartition'
},
format="csv",
format_options={
"withHeader": True
},
transformation_ctx = "datasource0"
)
But if you can't do this for any reason and want to stick with from_catalog, using a filter worked for me.
Assuming that one of your header's name is name, this is what the snippet can look like:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["name"] != "name")
Not very sure about how spark's dataframes or glue's dynamicframes deal with csv headers and why data read from catalog had headers in rows as well as schema, but this seemed to solve my issue by removing the header values from the rows.

Google Drive API v3 update an object if one exists with same name: list 'q' parameter does not work as documented?

I'm trying to update a file if it exists in a particular folder and has a specific name. In this instance the object in question is in a team drive. I followed documentation to compose the q parameter to the list call, tried to switch back to v2...It seems that the query is composed exactly correctly. That being said, even though I see multiple objects present in the target folder, the list call fails to see them. I've tried name = '' and name contains ''. There seems to be enough input validation put in place by the google team, as when i get creative the API bombs. Any pointers?
def import_or_replace_csv_to_td_folder(self, folder_id, local_fn, remote_fn, mime_type):
DRIVE = build('drive', 'v3', http=creds.authorize(Http()))
query = "'{0}' in parents and name = '{1}'.format(folder_id, remote_fn)
print("Searching for previous versions of this file : {0}".format(query))
check_if_already_exists = DRIVE.files().list(q=query, fields="files(id, name)").execute()
name_and_location_conflict = check_if_already_exists.get('files', [])
if not name_and_location_conflict:
body = {'name': remote_fn, 'mimeType': mime_type, 'parents': [folder_id]}
out = DRIVE.files().create(body=body, media_body=local_fn, supportsTeamDrives=True, fields='id').execute().get('id')
return out
else:
if len(name_and_location_conflict)==1:
file_id=name_and_location_conflict['id']
DRIVE.files().update(fileId=file_id, supportsTeamDrives=True, media_body=local_fn)
return file_id
else:
raise MultipleConflictsError("There are multiple documents matching parent folder and file name. Unclear which requires a version update")
When i tried to replace the 'name' parameter to 'title' (used to work in v2, based on some answers I reviewed) the API barfed
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://www.googleapis.com/drive/v3/files?q=%27xxxxxxxxxxxxxxxx%27+in+parents+and+title+%3D+%27Somefile_2018-09-27.csv%27&fields=files%28id%2C+name%29&alt=json returned "Invalid Value">
Thanks #tehhowch,
Indeed extra measures are necessary when the target in a team drive, namely includeTeamDriveItems option needs to be set, otherwise TD locations are not included by default:
check_if_already_exists = DRIVE.files().list(
q=query,
fields="files(id, name)",
supportsTeamDrives=True,
includeTeamDriveItems=True
).execute()

filter a glob-like regex pattern in boto3

Can I use boto3's filter tool for finding keys (technically sub-keys) in a bucket akin to files in a directory using glob?
I want to get a list of keys with a pattern like this "key/**/<pattern>/**.gz".
Unfortunately not. S3 provides no server-side support for filtering of results (other than by prefix and delimiter).
You can use the exrex library to generate all strings based on a regex and pass that to boto3. This is a simple example but you can imagine something a bit more complicated:
For example:
import exrex
import boto3
session = boto3.Session() # profile_name='xyz'
s3 = session.resource('s3')
bucket = s3.Bucket('mybucketname')
prefixes = list(exrex.generate(r'api/v2/responses/2016-11-08/(2016-11-08T2[2-3]|2016-11-09)'))
objects = []
for prefix in prefixes:
print(prefix, end=" ")
current_objects = list(bucket.objects.filter(Prefix=prefix))
print(len(current_objects))
objects += current_objects
This gives output:
api/v2/responses/2016-11-08/2016-11-08T22 1056
api/v2/responses/2016-11-08/2016-11-08T23 1056
api/v2/responses/2016-11-08/2016-11-09 24677
You can do this by (ab)using the paginator and using .gz as the delimiter. Paginator will return the common prefixes of the keys (in this case everything including the .gz file extension not including the bucket name, i.e. the entire Key) and you can do some regex compare against those strings.
I am not guessing at what your <pattern> is here, and the regex I have provided is a bit rough and ready but essentially what you want is this.
import boto3
import re
region = 'ap-southeast-2' ## <- YOUR REGION HERE
s3client = boto3.client('s3', region_name=region)
paginator = s3client.get_paginator('list_objects')
source_bucket = 'MY-BUCKET-NAME'
source_prefix = 'OPTIONAL-PREFIX/NESTED/'
pat = re.compile(r'key\/.+\/<pattern>\/.+.gz')
for result in paginator.paginate(Bucket=source_bucket, Prefix=source_prefix, Delimiter='.gz'):
for prefixes in result.get('CommonPrefixes'):
commonprefix = prefixes.get('Prefix')
key_path = commonprefix.split('/')
m = re.search(pat, key_path[2])
if m is not None:
print(commonprefix)