Defining schema in JsonLoader in PIG - json

I was trying to enter the schema of a dataset while using Pig from a JSON file using the JsonLoader.
The format of the data is as:
{
'cat_a':'some_text',
'cat_b':{(attribute_name):(attribute_value)}
}
I am trying to describe the schema as:
LOAD 'filename' USING JsonLoader('cat_a:chararray, cat_b:(attribute_name:chararray,attribute_value:int)');
I feel that I'm describing the schema incorrectly for cat_b.
Can someone help out in that?
Thanks in advance.

If your json is of the format
{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}]}
store the above json in test.json
run the below command
a = LOAD '/home/abhijit/Desktop/test.json' USING JsonLoader('recipe:chararray,ingredients: {(name:chararray)}');
dump a;
you will have output as
(Tacos,{(Beef),(Lettuce),(Cheese)},)
if your json is like below format
{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}],"inventor":{"name":"Alex","age":25}}
a = LOAD '/home/abhijit/Desktop/test.json' USING JsonLoader('recipe:chararray,ingredients: {(name:chararray)},inventor: (name:chararray, age:int)');
dump a;
output would be
(Tacos,{(Beef),(Lettuce),(Cheese)},(Alex,25))

Related

Error when importing GeoJson into BigQuery

I'm trying to load GeoJson data [1] into BigQuery via Cloud Shell but I'm getting the following error:
Failed to parse JSON: Top-level GeoJson 'type' member should have value 'Feature', but was 'FeatureCollection'.; ParsedString returned false; Could not parse value; Parser terminated before end of string
It feels like the GeoJson file is not formatted properly for BQ but I have no idea if that's true or how to fix it.
[1] https://github.com/tonywr71/GeoJson-Data/blob/master/australian-suburbs.geojson
Expounding on #scespinoza's answer, I was able to convert to new-line delimited GeoJSON and load it to Bigquery with the following steps:
geojson2ndjson geodata.txt > geodata_converted.txt
Using this command, I encountered an error:
But was able to create a workaround by splitting the data into 2 tables, applying the same command.
Loaded table in Bigquery:
Your file is in standard GeoJSON format, but BigQuery only accepts new-line delimited GeoJSON files and individual GeoJSON objects (see documentation: https://cloud.google.com/bigquery/docs/geospatial-data#geojson-files). So, you should first convert the dataset to the appropiated format. Here is a good and simple explanation on how it works: https://stevage.github.io/ndgeojson/.

Apache Nifi : How to create parquet file from CSV file with schema saved in "avro.schema" attribute

I am trying to create a parquet file from a CSV file using Apache Nifi.
I am able to convert the CSV to parquet file, but the problem is, the schema of the parquet file contains struct type(Which I need to overcome) and convert it into string type.
I am using Apache Nifi 1.14.0 on Windows Server 2016.
This is what I've tried to convert CSV to parquet till now...
I have used the below 3 controllers
CSVReader
CSVRecordSetWriter
ParquetRecordSetWriter
And, These are the processors/Flow
GetFile
ConvertRecord(CSVReader to CSVRecordSetWriter and this will automatically generate "avro.schema" attribute and in next step I am updating this attribute)
UpdateAttribute(Updating "avro.schema" attribute, where ever I've got 2 data types inferred, I am replacing it to '["null","string"]')
ConvertRecord(CSVReader to ParquetRecordSetWriter)
UpdatedAttribute(For appending '.parquet' in the filename)
PutFile
I also want to know, how to view a .parquet file in Windows OS. Currently, I am reading the parquet file via PySpark and checking the schema. :|
This is how parquet file schema looks like after conversion. I want string instead of Struct as output.
Please Note: There are lots of CSVs with many columns/fields. I don't want to create schema manually.
OR
Any other ways to achieve this would be very helpfull.
Thanks!
After playing around with some more options of "ParquetRecordSetWriter", I was able to create a parquet file with the schema that I've captured in "avro.schema" attribute.

Writing spark dataframe to ascii JSON

I am attempting to write a spark dataframe as JSON file; this will eventually be written out into MapR JSON DB table.
grp_small.toJSON.write.save("<path>")
This seems to write JSON file in snappy.parquet format. How do I force it to write it as a readable JSON (txt format) ?
You can write dataframe to json which contains each row as readable json in each line.
grp_small.write.json("path to output")
Hope this hepls!

how to save json data into csv using python

[I have json data like this 1]
I wanted to save the json into csv
the out put will be like this ,each tittle will be holding the information in that titile
I hope this gets converted to a comment, but look at Pandas, it can probably do what you want (Pandas json to csv)

How to read non-delimited JSON using Pig?

I have a json file where raw text looks like this:
{a:1,b:2,c:3}{a:3,b:3,c:5}{a:3,b:3,c:9}
Doing
raw = LOAD 'jsonfile.text' USING JsonLoader('a:chararry,b:chararray,c:chararry') ;
dump raw;
only returns 1 record.
Actual excerpt from log:
Input(s): Successfully read 1 records (630644858 bytes) from:
"s3n://logstash/ls.s3.ip-10-45-56-56.2016-03-02T23.10.part42.txt"
Output(s): Successfully stored 1 records (1900 bytes) in:
"hdfs://nameservice1/tmp/temp-1489272670/tmp-1959659634"
It looks like only the first record of the JSON is being read. The Json file is not delimited.
Anyone have any tips?
I would suggest doing a first pass which does the string replacement }{ -> }\n{. Then you will have one valid json object per line, and the json parsing should work.
Check the twitter elephant bird jar, that can be used to work with literally any kind of JSON data.
Check this for reference - Sample pig script working on JSON Data similar to yours!
https://gist.github.com/neilkod/2898455
Hope this helps!! <><