Converting csv file to JSON in flume - json

I am trying to pass a csv file from flume to kafka. I am able to pass the file directly using the following config file to pass the entire file from flume to Kafka.
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe the source
a1.sources.r1.type = exec
a1.sources.r1.command = cat /User/Desktop/logFile.csv
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = kafkaTopic
a1.sinks.k1.brokerList = localhost:9092
a1.sinks.sink1.batchSize = 20
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
But I want it to be converted to JSON format before passing to kafka for further processing. Can someone please advise me as how to convert a file from csv to JSON format.
Thanks!!

I think you need to write your own interceptor.
Start with implement interceptor interface
Read CSV from flume event body.
Parse it and Compose JSON
Stick it back to event body
Example: https://questforthought.wordpress.com/2014/01/13/using-flume-interceptor-multiplexing/

Related

Convert a JSON export file to BQ JSON Newline delimited JSON

I have a JSON export from a database and I'd like to upload it and create a new table on BQ. This file is 600MB and I tried to use the jq on mac terminal but I'm a noob and I couldn't do it...
Is there any way to a convert a random json file and get the result into this newline delimited JSON file? If yes, pls help me with this
You can load your JSON into cloud storage following this documentation.
Also with this code (python), you can load into BigQuery previously stored in a bucket.
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("post_abbr", "STRING"),
],
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.json"
load_job = client.load_table_from_uri(
uri,
table_id,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

How to parse 2 json files in Apache beam

I have 2 json configuration files to read and want to assign there values to variables. I am creating a data flow job using apache beam but unable to parse those files and assign there values to a variable.
config1.json - { "bucket_name": "mybucket"}
config2.json - { "dataset_name": "mydataset"}
This is the pipeline statements ---- I tried with one JSON file first but even that is not working
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser(custom_options.configfile))
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
I also tried creating a function to parse atleast one file. This is a sample method I created but it did not work.
class custom_json_parser(beam.DoFn):
import apache_beam as beam
from apache_beam.io.gcp import gcsio
import logging
def __init__(self, configfile):
self.configfile = configfile
def process(self, configfile):
logging.info("JSON PARSING STARTED")
with beam.io.gcp.gcsio.GcsIO().open(self.configfile, 'r') as f:
for line in f:
data = json.loads(line)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name') ```
Can someone please suggest the best method to resolve this issue in apache beam?
Thanks in Advance
If you need to read only once your files in the pipeline, don't read them in the pipeline, but before running it.
Read the files from GCS
Parse the file and put the useful content in the pipeline options map
Run your pipeline and use the data from the options
EDIT 1
You can use this piece of code to load the file and read it, before your pipeline. Simple Python, standard GCS libraries.
from google.cloud import storage
import json
client = storage.Client()
bucket = client.get_bucket('your-bucket')
blob = bucket.get_blob("name.json")
json_data = blob.download_as_string().decode('UTF-8')
print(json_data) # print -> {"name": "works!!"}
print(json.loads(json_data)["name"]) # print -> works!!
You can try following code snippet: -
Function to Parse File
class custom_json_parser(beam.DoFn):
def process(self, element):
logging.info(element)
data = json.loads(element)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name')
return [{"bucket": bucket , "dataset": dataset }]
Over Pipeline you can call function
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser())
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
It will work.

How to work with JSON in python

My Python Script gives a JSON output. How can I see it in the proper JSON format?
I tried with parsing with json.dumps() and json.loads(), but could not achieve the desired result.
======= Myscript.py ========
import sys
import jenkins
import json
import credentials
# Credentails
username = credentials.login['username']
password = credentials.login['password']
# Print the number of jobs present in jenkins
server = jenkins.Jenkins('http://localhost:8080', username=username, password=password)
# Get the installed Plugin info
plugins = server.get_plugins_info()
#parsed = json.loads(plugins) # take a string as input and returns a dictionary as output.
parsed = json.dumps(plugins) # take a dictionary as input and returns a string as output.
#print(json.dumps(parsed, indent=4, sort_keys=True))
print(plugins)
print(parsed)
It sounds like you want to pretty-print your JSON. You would need to pass the correct parameters to json.dumps():
parsed = json.dumps(plugins, sort_keys=True, indent=4)
Check and see if that is what you are looking for.

Python - Call Variable within Variable in a Function

I'm reading a JSON file and extracting certain data from that file. One of my variables extracts a global envName variable and sets that = to fi_var. I would like the next variable in my function to use fi_var as a variable since fi_var is set to the correct FI. This way I don't have to pass in the FI for each variable. There are other areas where I could benefit from this capability also. If I can get it to work once I can repeat the behavior. I'm new to Python so please excuse me if I don't use the correct terminology.
EXAMPLE.
with open ('F5EnvRules.json') as data_file:
data = json.load(data_file)`
def prodwebapp ():
fi_var = data["GLOBAL"]["Prod - envName"] # fi_var = the FI after reading the JSON file
fi_www_node_port_var = data["FI"]["portNumber"] # Want to replace "FI" with fi_var
fi_www_node_name = data["FI"]["nodeIP_1"] # Same here
fi_web_snat_var = data["FI"]["snatIP"] # Same here
prodwebapp()
Thoughts?

Import Kaggle csv from download url to pandas DataFrame

I've been trying different methods to import the SpaceX missions csv file on Kaggle directly into a pandas DataFrame, without any success.
I'd need to send requests to login. This is what I have so far:
import requests
import pandas as pd
from io import StringIO
# Link to the Kaggle data set & name of zip file
login_url = 'http://www.kaggle.com/account/login?ReturnUrl=/spacex/spacex-missions/downloads/database.csv'
# Kaggle Username and Password
kaggle_info = {'UserName': "user", 'Password': "pwd"}
# Login to Kaggle and retrieve the data.
r = requests.post(login_url, data=kaggle_info, stream=True)
df = pd.read_csv(StringIO(r.text))
r is returning the html content of the page.
df = pd.read_csv(url) gives a CParser error:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 6
I've searched for a solution, but so far nothing I've tried worked.
You are creating a stream and passing it directly to pandas. I think you need to pass a file like object to pandas. Take a look at this answer for a possible solution (using post and not get in the request though).
Also i think the login url with redirect that you use is not working as it is. I know i suggested that here. But i ended up not using is because the post request call did not handle the redirect (i suspect).
The code i ended up using in my project was this:
def from_kaggle(data_sets, competition):
"""Fetches data from Kaggle
Parameters
----------
data_sets : (array)
list of dataset filenames on kaggle. (e.g. train.csv.zip)
competition : (string)
name of kaggle competition as it appears in url
(e.g. 'rossmann-store-sales')
"""
kaggle_dataset_url = "https://www.kaggle.com/c/{}/download/".format(competition)
KAGGLE_INFO = {'UserName': config.kaggle_username,
'Password': config.kaggle_password}
for data_set in data_sets:
data_url = path.join(kaggle_dataset_url, data_set)
data_output = path.join(config.raw_data_dir, data_set)
# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)
# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data=KAGGLE_INFO, stream=True)
# Writes the data to a local file one chunk at a time.
with open(data_output, 'wb') as f:
# Reads 512KB at a time into memory
for chunk in r.iter_content(chunk_size=(512 * 1024)):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
Example use:
sets = ['train.csv.zip',
'test.csv.zip',
'store.csv.zip',
'sample_submission.csv.zip',]
from_kaggle(sets, 'rossmann-store-sales')
You might need to unzip the files.
def _unzip_folder(destination):
"""Unzip without regards to the folder structure.
Parameters
----------
destination : (str)
Local path and filename where file is should be stored.
"""
with zipfile.ZipFile(destination, "r") as z:
z.extractall(config.raw_data_dir)
So i never really directly loaded it into the DataFrame, but rather stored it to disk first. But you could modify it to use a temp directory and just delete the files after you read them.