Json Parsing in Apache Pig - json

I am Having a json :
{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}
I found that we are able to load json into PigScript.
A = LOAD ‘data.json’
USING PigJsonLoader();
But how to parse json in Apache Pig
--Sampling.pig
--pig -x mapreduce -f Sampling.pig -param input=foo.csv -param output=OUT/pig -param delimiter="," -param fraction='0.05'
--Load data
inputdata = LOAD '$input' using PigStorage('$delimiter');
--Group data
groupedByAll = group inputdata all;
--output into hdfs
sampled = SAMPLE inputdata $fraction;
store sampled into '$output' using PigStorage('$delimiter');
Above is my pig script.
How to parse json (each element) in Apache pig?
I need to take above json as input and parse its source,delimiter,fraction,output and pass in $input,$delimiter,$fraction,$output respectively.
How to parse the same .
Please suggest

Try this :
--Load data
inputdata = LOAD '/input.txt' using JsonLoader('Name:chararray,elementinfo:(fraction:chararray),destionation:chararray,source:chararray');
--Group data
groupedByAll = group inputdata all;
store groupedByAll into '/OUT/pig' using PigStorage(',');
Now your output looks :
all,{(sampling1,(4),/user/sree/OUT1,/user/sree/foo1.txt),(sampling,(3),/user/sree/OUT,/user/sree/foo.txt)}
In input file fraction data {"fraction":"3"} in double quotes. so i used fraction as chararray so can't able to run sample command so i used the above script to get the result.
if you want to perform sample operation cast the fraction data to int and then you will get the result.

Related

Unable to parse nested json in logstash

My application generates the below mentioned mulogs, which is in a nested json logs. While am trying to parse it in kibana, the json parse fails. Below is my log sample,
2022-08-04T12:43:03.977Z {"tags":
{"caller":"sphe",
"job-id":"1",
"mulog/duration":"3180930",
"mulog/namespace":"tool.utilities.db",
"mulog/outcome":":ok",
"user-name":"Pol",
"type":":execute!",
"app-name":"kan",
"mulog/parent-trace":"_YznrMCc",
"user-id":"52-7d4128fb7cb7",
"sql":"SELECT data FROM kan.material_estimate_history WHERE job_id = '167aa1cc' ",
"result":"[]",
"within-tx":"false",
"mulog/root-trace":"S0yn8jclLsmNVyKpH",
"mulog/timestamp":"1659616983977",
"uri":"/api/kan/material-estimates/job/f14b167aa1cc",
"mulog/trace-id":"kI4grnAMe4bGmFc_aX",
"request-method":":get",
"mulog/event-name":":kan=.source.db.material-estimates/find-history-by-job-id"},
"localEndpoint":{"serviceName":"kan"},
"name":"kan.source.db.material-estimates/find-history-by-job-id",
"traceId":"112721d07ecc9be",
"duration":3180,"id":"c90a259a2",
"kind":"SERVER","timestamp":1659616983977000,
"parentId":"dd7368"}

Access Xcom in S3ToSnowflakeOperatorof Airflow

My use case is i have an S3 event which triggers a lambda (upon an S3 createobject event), which in turn invokes an Airflow DAG passing in a couple of --conf values (bucketname, filekey).
I am then extracting the key value using a Python operator and storing in an xcom variable. I then want to extract this xcom value within a S3ToSnowflakeOperator and essentially load the file into a Snowflake table.
All parts of the process are working bar the extraction of xcom value within the S3ToSnowflakeOperator task. I basically get the following in my logs.
query: [COPY INTO "raw".SOURCE_PARAMS_JSON FROM #MYSTAGE_PARAMS_DEMO/ files=('{{ ti.xcom...]
which looks like the jinja template is not correctly resolving the xcom value.
My code is as follows:
from airflow import DAG
from airflow.utils import timezone
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.providers.snowflake.transfers.s3_to_snowflake import S3ToSnowflakeOperator
FILEPATH = "demo/tues-29-03-2022-6.json"
args = {
'start_date': timezone.utcnow(),
'owner': 'airflow',
}
with DAG(
dag_id='example_dag_conf',
default_args=args,
schedule_interval=None,
catchup=False,
tags=['params demo'],
) as dag:
def run_this_func(**kwargs):
outkey = '{}'.format(kwargs['dag_run'].conf['key'])
print(outkey)
ti = kwargs['ti']
ti.xcom_push(key='FILE_PATH', value=outkey)
run_this = PythonOperator(
task_id='run_this',
python_callable=run_this_func
)
get_param_val = BashOperator(
task_id='get_param_val',
bash_command='echo "{{ ti.xcom_pull(key="FILE_PATH") }}"',
dag=dag)
copy_into_table = S3ToSnowflakeOperator(
task_id='copy_into_table',
s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"],
snowflake_conn_id=SNOWFLAKE_CONN_ID,
stage=SNOWFLAKE_STAGE,
schema="""\"{0}\"""".format(SNOWFLAKE_RAW_SCHEMA),
table=SNOWFLAKE_RAW_TABLE,
file_format="(type = 'JSON')",
dag=dag,
)
run_this >> get_param_val >> copy_into_table
If I replace
s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"],
with
s3_keys=[FILEPATH]
My operator works fine and the data is loaded into Snowflake. So the error is centered on resolving s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"], i believe?
Any guidance/help would be appreciated. I am using Airflow 2.2.2
I removed the S3ToSnowflakeOperator and replaced with the SnowflakeOperator.
I was then able to reference the xcom value (as above) for the sql param value.
**my xcom value was a derived COPY INTO statement effectively replicating the functionality of the S3ToSnowflakeOperator. With the added advantage of being able to store the metadata file information within the table columns too.

Add a new line in front of each line before writing to JSON format using Spark in Scala

I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket:
df.createOrReplaceTempView("ParquetTable")
val parkSQL = spark.sql("select LAST_MODIFIED_BY, LAST_MODIFIED_DATE, NVL(CLASS_NAME, className) as CLASS_NAME, DECISION, TASK_TYPE_ID from ParquetTable")
parkSQL.show(false)
parkSQL.count()
parkSQL.write.json("s3://test-bucket/json-output-7/")
with only this command, it'll produce files with contents below:
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
but, what I'd like to achieve is something like below:
{"index":{}}
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
Any insight on how to achieve this result would be greatly appreciated!
Below code will concat {"index":{}} with existing row data in DataFrame & It will convert data into json then save json data using text format.
df
.select(
lit("""{"index":{}}""").as("index"),
to_json(struct($"*")).as("json_data")
)
.select(
concat_ws(
"\n", // This will split index column & other column data into two lines.
$"index",
$"json_data"
).as("data")
)
.write
.format("text") // This is required.
.save("s3://test-bucket/json-output-7/")
Final Output
cat part-00000-24619b28-6501-4763-b3de-1a2f72a5a4ec-c000.txt
{"index":{}}
{"CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}

Dump a list into a JSON file acceptable by Athena

I am creating a JSON file in an s3 bucket using the following code -
def myconverter(o):
if isinstance(o, datetime.datetime):
return o.__str__()
s3.put_object(
Bucket='sample-bucket',
Key="sample.json",
Body = json.dumps(whole_file, default=myconverter)
)
Here, the whole_file variable is a list.
Sample of the "whole_file" variable -
[{"sample_column1": "abcd","sample_column2": "efgh"},{"sample_column1": "ijkl","sample_column2": "mnop"}]
The output "sample.json" file that I get should be in the following format -
{"sample_column1": "abcd","sample_column2": "efgh"}
{"sample_column1": "ijkl","sample_column2": "mnop"}
The output "sample.json" that I am getting is -
[{"sample_column1": "abcd","sample_column2": "efgh"},{"sample_column1": "ijkl","sample_column2": "mnop"}]
What changes should be made to get each JSON object in a single line?
You can write each entry to the file, then upload the file object to s3
import json
whole_file = [{"sample_column1": "abcd","sample_column2": "efgh"},
{"sample_column1": "ijkl","sample_column2": "mnop"}
]
with open("temp.json", "w") as temp:
for record in whole_file:
temp.write(json.dumps(record, default=str))
temp.write("\n")
The lookput should look like this
~ cat temp.json
{"sample_column1": "abcd", "sample_column2": "efgh"}
{"sample_column1": "ijkl", "sample_column2": "mnop"}
upload the file
import boto3
s3 = boto3.client("s3")
s3.upload_file("temp.json", bucket, object_name="whole_file.json")

Pig: parse bytearray as a string/json

I have some json data format saved to S3 in SequenceFile format by secor. I want to analyze it using Pig. Using elephant-bird I managed to get it from S3 in bytearray format, but I wasn't able to convert it to chararray, which is apparently needed to parse Json:
%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare BYTES_CONVERTER 'com.twitter.elephantbird.pig.util.BytesWritableConverter';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
grunt> A = LOAD 's3n://...logs/raw_logs/...events/dt=2015-12-08/1_0_00000000000085594299'
USING $SEQFILE_LOADER ('-c $LONG_CONVERTER', '-c $BYTES_CONVERTER')
AS (key: long, value: bytearray);
grunt> B = LIMIT A 1;
grunt> DUMP B;
(85653965,{"key": "val1", other json data, ...})
grunt> DESCRIBE B;
B: {key: long,value: bytearray}
grunt> C = FOREACH B GENERATE (key, (chararray)value);
grunt> DUMP C;
2015-12-08 19:32:09,133 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1075: Received a bytearray from the UDF or Union from two different Loaders.
Cannot determine how to convert the bytearray to string.
Using TextConverter insted of the BytesWritableConverter just leaves me with empty values, like:
(85653965,)
It's apparent that Pig was able to cast the byte array to a string to dump it, so it doesn't seem like it should be imposible. How do I do that?