Pig: Create json file with actual key_name and values - json

I have a pig script using elephant bird json loader.
data_input = LOAD '$DATA_INPUT' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []);
x = FOREACH data_input GENERATE json#'user__id_str', json#'user__created_at', json#'user__notifications', json#'user__follow_request_sent', json#'user__friends_count', json#'user__name', json#'user__time_zone', json#'user__profile_background_color', json#'user__is_translation_enabled', json#'user__profile_link_color', json#'user__utc_offset', json#'user__profile_sidebar_border_color', json#'user__has_extended_profile', json#'user__profile_background_tile', json#'user__is_translator', json#'user__profile_text_color', json#'user__location', json#'user__profile_banner_url', json#'user__profile_use_background_image', json#'user__default_profile_image', json#'user__description', json#'user__profile_background_image_url_https', json#'user__profile_sidebar_fill_color', json#'user__followers_count', json#'user__profile_image_url', json#'user__geo_enabled', json#'user__entities__description__urls', json#'user__screen_name', json#'user__favourites_count', json#'user__url', json#'user__statuses_count', json#'user__default_profile', json#'user__lang', json#'user__protected', json#'user__listed_count', json#'user__profile_image_url_https', json#'user__contributors_enabled', json#'user__following', json#'user__verified';
STORE x INTO '$DATA_OUTPUT' USING JsonStorage();
I have the output right but the field names are wrong.
My output has val_n instaed of the field names themselves:
{"val_0":"40510796","val_1":"Sat May 16 18:03:53 +0000 2009","val_2":"false"......}
I want something like:
{"user__id_str":"40510796","user__created_at":"Sat May 16 18:03:53 +0000 2009","user__notifications":"false"...........}
How can I get the column names as well?

Have you tried giving Alias in generate statement:
x = FOREACH data_input GENERATE json#'user__id_str' AS user__id_str, json#'user__created_at' AS user__created_at;

Related

Most efficient way to read concatenated json in PySpark?

I'm getting one json file where each line in the json is a json itself of 1000 objects, like this:
{"id":"test1", "results": [{"property1": "sample1"},{"property2": "sample2"}]}
{"id":"test2", "results": [{"property1": "sample3"},{"property2": "sample4"}]}
If I read it as a json using spark.read.json(filepath), I'm getting:
+-----+--------------------+
| id| results|
+-----+--------------------+
|test1|[{sample1, null},...|
+-----+--------------------+
(Which is only the first json in the concatenated json)
While I'm trying to get:
+-----+---------+---------+
|id |property1|property2|
+-----+---------+---------+
|test1|sample1 |sample1 |
|test2|sample3 |sample4 |
+-----+---------+---------+
I end up by reading the json as text, and iterate over each row to treat it as json and union each dataframe:
df = (spark.read.text(data[self.files]))
dataCollect = df.collect()
i = 0
for row in dataCollect:
df_row = flatten_json(spark.read.json(spark.sparkContext.parallelize(row)))
if i == 0:
df_all = df_row
else:
df_all = df_row.unionByName(df_all, allowMissingColumns = True)
i = i + 1
flatten_json is a helper that helps me to automatically flatten the json.
I guess there is a better approach, any help would be much appreciate
Your JSON file is called JSON Lines or JSONL which is a supported file format that Pyspark can handle natively. So, use the regular spark.read.json to read it and perform the additional transformations to match with what you want.
df = spark.read.json('yourfile.json or json/directory')
# Explode the array into structs. This will generate lots of nulls.
df = (df.select('id', F.explode('results').alias('results'))
.select('id', 'results.*'))
# Group them and aggregate to remove the nulls.
df = (df.groupby('id')
.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns if x != 'id']))
I think this works fine for 1000 lines JSONL, however, if you are curious about alternative solution that doesn't involve generating/removing nulls, please check here: By using PySpark how to parse nested json. In some situations, the alternative solution which doesn't do explode could be more performant.

Dimension problem when converting a MATLAB .m script into an Octave compatible syntax

I want to run a MATLAB script M-file to reconstruct a point cloud in Octave. Therefore I had to rewrite some parts of the code to make it compatible with Octave. Actually the M-file works fine in Octave (I don't get any errors) and also the plotted point cloud looks good at first glance, but it seems that the variables are only half the size of the original MATLAB variables. In the attached screenshots you can see what I mean.
Octave:
MATLAB:
You can see that the dimension of e.g. M in Octave is 1311114x3 but in MATLAB it is 2622227x3. The actual number of rows in my raw file is 2622227 as well.
Here you can see an extract of the raw file (original data) that I use.
Rotation angle Measured distance
-0,090 26,295
-0,342 26,294
-0,594 26,294
-0,846 26,295
-1,098 26,294
-1,368 26,296
-1,620 26,296
-1,872 26,296
In MATLAB I created my output variable as follows.
data = table;
data.Rotationangle = cell2mat(raw(:, 1));
data.Measureddistance = cell2mat(raw(:, 2));
As there is no table function in Octave I wrote
data = cellfun(#(x)str2num(x), strrep(raw, ',', '.'))
instead.
Octave also has no struct2array function, so I had to replace it as well.
In MATLAB I wrote.
data = table2array(data);
In Octave this was a bit more difficult to do. I had to create a struct2array function, which I did by means of this bug report.
%% Create a struct2array function
function retval = struct2array (input_struct)
%input check
if (~isstruct (input_struct) || (nargin ~= 1))
print_usage;
endif
%convert to cell array and flatten/concatenate output.
retval = [ (struct2cell (input_struct)){:}];
endfunction
clear b;
b.a = data;
data = struct2array(b);
Did I make a mistake somewhere and could someone help me to solve this problem?
edit:
Here's the part of my script where I'm using raw.
delimiter = '\t';
startRow = 5;
formatSpec = '%s%s%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'HeaderLines' ,startRow-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
fclose(fileID);
%% Convert the contents of columns containing numeric text to numbers.
% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,2]
% Converts text in the input cell array to numbers. Replaced non-numeric
% text with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1)
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\.]*)+[\,]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\.]*)*[\,]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData(row), regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if numbers.contains('.')
thousandsRegExp = '^\d+?(\.\d{3})*\,{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'))
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric text to numbers.
if ~invalidThousandsSeparator
numbers = strrep(numbers, '.', '');
numbers = strrep(numbers, ',', '.');
numbers = textscan(char(numbers), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch
raw{row, col} = rawData{row};
end
end
end
You don't see any raw in my workspaces because I clear all temporary variables before I reconstruct my point cloud.
Also my original data in row 1311114 and 1311115 look normal.
edit 2:
As suggested here is a small example table to clarify what I want and what MATLAB does with the table2array function in my case.
data =
-0.0900 26.2950
-0.3420 26.2940
-0.5940 26.2940
-0.8460 26.2950
-1.0980 26.2940
-1.3680 26.2960
-1.6200 26.2960
-1.8720 26.2960
With the struct2array function I used in Octave I get the following array.
data =
-0.090000 26.295000
-0.594000 26.294000
-1.098000 26.294000
-1.620000 26.296000
-2.124000 26.295000
-2.646000 26.293000
-3.150000 26.294000
-3.654000 26.294000
If you compare the Octave array with my original data, you can see that every second row is skipped. This seems to be the reason for 1311114 instead of 2622227 rows.
edit 3:
I tried to solve my problem with the suggestions of #Tasos Papastylianou, which unfortunately was not successful.
First I did the variant with a struct.
data = struct();
data.Rotationangle = [raw(:,1)];
data.Measureddistance = [raw(:,2)];
data = cell2mat( struct2cell (data ).' )
But this leads to the following structure in my script. (Unfortunately the result is not what I would like to have as shown in edit 2. Don't be surprised, I only used a small part of my raw file to accelerate the run of my script, so here are only 769 lines.)
[766,1] = -357,966
[767,1] = -358,506
[768,1] = -359,010
[769,1] = -359,514
[1,2] = 26,295
[2,2] = 26,294
[3,2] = 26,294
[4,2] = 26,296
Furthermore I get the following error.
error: unary operator '-' not implemented for 'cell' operands
error: called from
Cloud_reconstruction at line 137 column 11
Also the approach with the dataframe octave package didn't work. When I run the following code it leads to the error you can see below.
dataframe2array = #(df) cell2mat( struct(df).x_data );
pkg load dataframe;
data = dataframe();
data.Rotationangle = [raw(:, 1)];
data.Measureddistance = [raw(:, 2)];
dataframe2array(data)
error:
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 147 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 106 column 20
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 176 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 106 column 20
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 147 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 107 column 23
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 176 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 107 column 23
error: RHS(_,2): but RHS has size 768x1
error: called from
df_matassign at line 179 column 11
subsasgn at line 172 column 14
Cloud_reconstruction at line 107 column 23
Both error messages refer to the following part of my script where I'm doing the reconstruction of the point cloud in cylindrical coordinates.
distLaserCenter = 47; % Distance between the pipe centerline and the blind zone in mm
m = size(data,1); % Find the length of the first dimension of data
zincr = 0.4/360; % z increment in mm per deg
data(:,1) = -data(:,1);
for i = 1:m
data(i,2) = data(i,2) + distLaserCenter;
if i == 1
data(i,3) = 0;
elseif abs(data(i,1)-data(i-1)) < 100
data(i,3) = data(i-1,3) + zincr*(data(i,1)-data(i-1));
else abs(data(i,1)-data(i-1)) > 100;
data(i,3) = data(i-1,3) + zincr*(data(i,1)-(data(i-1)-360));
end
end
To give some background information for a better understanding. The script is used to reconstruct a pipe as a point cloud. The surface of the pipe was scanned from inside with a laser and the laser measured several points (distance from laser to the inner wall of the pipe) at each deg of rotation. I hope this helps to understand what I want to do with my script.
Not sure exactly what you're trying to do, but here's a toy example of how a struct could be used in an equivalent manner to a table:
matlab:
data = table;
data.A = [1;2;3;4;5];
data.B = [10;20;30;40;50];
table2array(data)
octave:
data = struct();
data.A = [1;2;3;4;5];
data.B = [10;20;30;40;50];
cell2mat( struct2cell (data ).' )
Note the transposition operation (.') before passing the result to cell2mat, since in a table, the 'fieldnames' are arranged horizontally in columns, whereas the struct2cell ends up arranging what used to be the 'fieldnames' as rows.
You might also be interested in the dataframe octave package, which performs similar functions to matlab's table (or in fact, R's dataframe object): https://octave.sourceforge.io/dataframe/ (you can install this by typing pkg install -forge dataframe in your console)
Unfortunately, the way to display the data as an array is still not ideal (see: https://stackoverflow.com/a/55417141/4183191), but you can easily convert that into a tiny function, e.g.
dataframe2array = #(df) cell2mat( struct(df).x_data );
Your code can then become:
pkg load dataframe;
data = dataframe();
data.A = [1;2;3;4;5];
data.B = [10;20;30;40;50];
dataframe2array(data)

JQ 1.5: Timestamp / date transformation produce huge file

I using jq 1.5 under a Windows 10 powershell enviroment to transform json files and import them to a MS SQL database. The original json file is around 1,1mb. I stored the file here: Json origin file. I use following jq command to transform the data:
[.legs[] | {Legid: .legId, Farecode: .fareBasisCode, Travelduration: .travelDuration, Traveldistance: .totalTravelDistance, Distanceunit: .totalTravelDistanceUnits, Refundable: .isRefundable , Nonstop: .isNonStop, Departure_Airport: .segments[].departureAirportName, Departure_Code: .segments[].departureAirportCode, Arrival_Airport: .segments[].arrivalAirportName, Arrival_Code: .segments[].arrivalAirportCode, Departure_Time: .segments[].departureTimeEpochSeconds, Arrival_Time: .segments[].arrivalTimeEpochSeconds, Airline: .segments[].airlineName, Airline_Code: .segments[].airlineCode, Flight_Number: .segments[].flightNumber, Equipment: .segments[].equipmentDescription}]
That command produce following file transformed file. Now i had to transform the UNIX Timestamps to Dates. So i modified the command:
[.legs[] | {Legid: .legId, Farecode: .fareBasisCode, Travelduration: .travelDuration, Traveldistance: .totalTravelDistance, Distanceunit: .totalTravelDistanceUnits, Refundable: .isRefundable , Nonstop: .isNonStop, Departure_Airport: .segments[].departureAirportName, Departure_Code: .segments[].departureAirportCode, Arrival_Airport: .segments[].arrivalAirportName, Arrival_Code: .segments[].arrivalAirportCode, Departure_Time: .segments[].departureTimeEpochSeconds, Arrival_Time: .segments[].arrivalTimeEpochSeconds, Airline: .segments[].airlineName, Airline_Code: .segments[].airlineCode, Flight_Number: .segments[].flightNumber, Equipment: .segments[].equipmentDescription}] | .[].Departure_Time |= todate | .[].Arrival_Time |= todate
The transformed file without date Transformation have around 3 mb. After the date Transformation the file have around 40 mb. I think i have a logical error in my command, but cant find it. Tips?
Regards
Timo
Your use of iteration (.segments[]) causes multiplicative behavior: in your case, since there are four cases in which .segments|length is 2, you get a 2^10 expansion locally, four times.
In situations such as this, it would make sense to use a small but well-chosen subset of the data (or maybe more easily, an artificial dataset) to check the code.
Perhaps what you intended is something more like:
[ .legs[] | range(0; .segments|length) as $i | .... ]

Unable to Read JSON file using Elephant Bird

Trying to load the json file which is having null values in it by using elephant-bird JsonLoader.
sample.json
{"created_at": "Mon Aug 22 10:48:23 +0000 2016","id": 767674772662607873,"id_str": "767674772662607873","text": "KPIT Image Result for https:\/\/t.co\/Nas2ZnF1zZ... https:\/\/t.co\/9TnelwtIvm","source": "\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated": false,"in_reply_to_status_id": 123,"in_reply_to_status_id_str": null,"in_reply_to_user_id": null,"in_reply_to_user_id_str": null,"in_reply_to_screen_name": null,"geo": null,"coordinates": null,"place": null,"contributors": null,"is_quote_status": false,"retweet_count": 0,"favorite_count": 0,"entities": {"hashtags": [],"urls": [{"url": "https:\/\/t.co\/Nas2ZnF1zZ","expanded_url": "http:\/\/miltonious.com\/","display_url": "miltonious.com","indices": [24, 47]}],"user_mentions": [],"symbols": []},"favorited": false,"retweeted": false,"possibly_sensitive": false,"filter_level": "low","lang": "en","timestamp_ms": "1471862903167"}
script:
REGISTER piggybank.jar
REGISTER json-simple-1.1.1.jar
REGISTER elephant-bird-pig-4.3.jar
REGISTER elephant-bird-core-4.1.jar
REGISTER elephant-bird-hadoop-compat-4.3.jar
json = LOAD 'sample.json' USING JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray');
describe json;
dump json;
When I dump json,I am getting the following output and the worning
(Mon Aug 22 10:48:23 +0000 2016,767674772662607873,767674772662607873,google Image Result for Twitter Web Client,false,1234,12345,3214,43215,,,,,,,,,,,,,,)
WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.JsonLoader(UDF_WARNING_1): Bad record, returning null for {complete json}
By warning i guess it is getting NULL values.
So how can we load a Json which is having null values in it.
And I have tried in another way i.e
json = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray');
describe json;
Output
Schema for json unknown.
Please suggest me.
Thanks.
You can try something like this,
MY_JSON = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
dump MY_JSON;

Json Parsing in Apache Pig

I am Having a json :
{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}
I found that we are able to load json into PigScript.
A = LOAD ‘data.json’
USING PigJsonLoader();
But how to parse json in Apache Pig
--Sampling.pig
--pig -x mapreduce -f Sampling.pig -param input=foo.csv -param output=OUT/pig -param delimiter="," -param fraction='0.05'
--Load data
inputdata = LOAD '$input' using PigStorage('$delimiter');
--Group data
groupedByAll = group inputdata all;
--output into hdfs
sampled = SAMPLE inputdata $fraction;
store sampled into '$output' using PigStorage('$delimiter');
Above is my pig script.
How to parse json (each element) in Apache pig?
I need to take above json as input and parse its source,delimiter,fraction,output and pass in $input,$delimiter,$fraction,$output respectively.
How to parse the same .
Please suggest
Try this :
--Load data
inputdata = LOAD '/input.txt' using JsonLoader('Name:chararray,elementinfo:(fraction:chararray),destionation:chararray,source:chararray');
--Group data
groupedByAll = group inputdata all;
store groupedByAll into '/OUT/pig' using PigStorage(',');
Now your output looks :
all,{(sampling1,(4),/user/sree/OUT1,/user/sree/foo1.txt),(sampling,(3),/user/sree/OUT,/user/sree/foo.txt)}
In input file fraction data {"fraction":"3"} in double quotes. so i used fraction as chararray so can't able to run sample command so i used the above script to get the result.
if you want to perform sample operation cast the fraction data to int and then you will get the result.