I want to load JSON OR CSV into HBASE with out using any mapreduce programs as well as HIVEQL/pig support, is it possible and which one is more efficient hive-hbase or mapreduce-hbase.
I use a Perl Script to do this;
This is my (perl generated) JSON file
{"c3":"c","c4":"d","c5":"tim","c2":"b","c6":"andrew","c1":"a"},"CURRENTLY20140131":{"c2":"tim2","c1":"bill2"},"THERE20140131"::{"c3":"c","c4":"d","c9":"bill2","c10":"tim2","c2":"b","c6":"andrew","c7":"bill","c5":"tim","c1":"a","c8":"tom"},"TODAY20140131":{"c2":"bill","c1":"tom"}}
I am Sharding on a STRING, with multiple columns depending who/what has referenced the key object.
use strict;
use warnings;
use Data::Dumper;
use JSON::XS qw(encode_json decode_json);
use File::Slurp qw(read_file write_file);
my %words = ();
my $debug = 0;
sub ReadHash {
my ($filename) = #_;
my $json = read_file( $filename, { binmode => ':raw' } );
%words = %{ decode_json $json };
}
# Main Starts here
ReadHash("Save.json");
foreach my $key (keys %words)
{
printf("put 'test', '$key',");
my $cnt=0;
foreach my $key2 ( keys %{ $words{$key} } ) {
my $val = $words{$key}{$key2};
print "," if $cnt>0;
printf("'cf:$key2', '$val'");
++$cnt;
}
print "\n";
}
Generate the Hbase commands and then execute them.
Alternativly - I would look at happybase (Python) which also loads large data sets very quickly.
Hope this helps
This should produce output like .....
put 'test', 'WHERE20140131','cf:c2', 'bill2','cf:c1', 'tim2'
put 'test', 'OMAN20140131','cf:c3', 'c','cf:c4', 'd','cf:c5', 'tim','cf:c2', 'b','cf:c1', 'a','cf:c6', 'andrew'
put 'test', 'CURRENTLY20140131','cf:c2', 'tim2','cf:c1', 'bill2'
Maybe you can refer to bulk loading. Here is the link. bulk loading
Related
I am getting a "csv file" from a vendor (using their API), but what they do is just spew the whole thing into their response. It wouldn't be a significant problem except that, of course, some of those pesky humans entered the data and put in "features" like line breaks. What I am doing now is creating a file for the raw data and then reopening it to read the data:
open RAW, ">", "$rawfile" or die "ERROR: Could not open $rawfile for write: $! \n";
print RAW $response->content;
close RAW;
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
open my $fh, "<", "$rawfile" or die "ERROR: Could not open $rawfile for read: $! \n";
while ( $line = $csv->getline ($fh) ) { ...
Somehow this seems ... inelegant. It seems that I ought to be able to just read the data from the $response->content (multiline string) as if it were a file. But I'm drawing a total blank on how do this.
A pointer would be greatly appreciated.
Thanks,
Paul
You could use a string filehandle:
my $data = $response->content;
open my $fh, "<", \$data or croak "unable to open string filehandle : $!";
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
while ( $line = $csv->getline ($fh) ) { ... }
Yes, you can use Text::CSV_XS on a string, via its functional interface
use warnings;
use strict;
use feature 'say';
use Text::CSV_XS qw(csv); # must use _XS version
my $csv = qq(a,line\nand,another);
my $aoa = csv(in => \$csv)
or die Text::CSV->error_diag;
say "#$_" for #aoa;
Note that this indeed needs Text::CSV_XS (normally Text::CSV works but not with this).
I don't know why this isn't available in the OO interface (or perhaps is but is not documented).
While the above parses the string directly as asked, one can also lessen the "inelegant" aspect in your example by writing content directly to a file as it's acquired, what most libraries support like with :content_file option in LWP::UserAgent::get method.
Let me also note that most of the time you want the library to decode content, so for LWP::UA to use decoded_content (see HTTP::Response).
I cooked up this example with Mojo::UserAgent. For the CSV input I used various data sets from the NYC Open Data. This is also going to appear in the next update for Mojo Web Clients.
I build the request without making the request right away, and that gives me the transaction object, $tx. I can then replace the read event so I can immediately send the lines into Text::CSV_XS:
#!perl
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content->unsubscribe('read')->on(read => sub {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new;
};
state $buffer;
state $reader = do {
open my $r, '<:encoding(UTF-8)', \$buffer;
$r;
};
my ($content, $bytes) = #_;
$buffer .= $bytes;
while (my $row = $csv->getline($reader) ) {
say join ':', $row->#[2,4];
}
});
$tx = $ua->start($tx);
That's not as nice as I'd like it to be because all the data still show up in the buffer. This is slightly more appealing, but it's fragile in the ways I note in the comments. I'm too lazy at the moment to make it any better because that gets hairy very quickly as you figure out when you have enough data to process a record. My particular code isn't as important as the idea that you can do whatever you like as the transactor reads data and passes it into the content handler:
use v5.10;
use strict;
use warnings;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content
->unsubscribe('read')
->on( read => process_bytes_factory() );
$tx = $ua->start($tx);
sub process_bytes_factory {
return sub ( $content, $bytes ) {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new( { decode_utf8 => 1 } );
};
state $buffer = '';
state $line_no = 0;
$buffer .= $bytes;
# fragile if the entire content does not end in a
# newline (or whatever the line ending is)
my $last_line_incomplete = $buffer !~ /\n\z/;
# will not work if the format allows embedded newlines
my #lines = split /\n/, $buffer;
$buffer = pop #lines if $last_line_incomplete;
foreach my $line ( #lines ) {
my $status = $csv->parse($line);
my #row = $csv->fields;
say join ':', $line_no++, #row[2,4];
}
};
}
My program logs in to a list of IPs, identifies the software running on it, and prints the output.
I want the output to be in the form of a JSON array.
Can a JSON encode function be used for this?
use strict;
use warnings;
use QA::unit::testbedinfo;
my #machines_under_test = ( ... ); # list of ip's listed here
sub test_1_get_install_info_of_machines_under_test {
my ( $self ) = #_;
my %output;
foreach my $ip ( #machines_under_test ) {
my $output = $self->{'queryObj'}->get_install_info( $ip );
push #{ $output{$output} }, $ip;
INFO( ' software version running on machine ' . $ip . ' : ' . $output );
}
return 1;
}
So it looks like you're building a hash, %output, which has software version numbers for keys and (references to) arrays of IP addresses for values, correct? To output that structure as JSON, just use the JSON module and print the output of the to_json function:
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
use JSON 'to_json';
my %output = (
'1.0' => [ qw( 1.2.3.4 5.6.7.8 ) ],
'1.1' => [ qw( 192.168.0.3 192.168.37.42 192.168.0.123 ) ]
);
# Note that to_json takes a reference to the structure, not the raw hash
say to_json(\%output);
Which produces the output:
{"1.0":["1.2.3.4","5.6.7.8"],"1.1":["192.168.0.3","192.168.37.42","192.168.0.123"]}
Here is my code that I try to open the file to get data and change it to UTF-8, then read each line and store it in variable my $abstract_text and send it back in JSON structure.
my $fh;
if (!open($fh, '<:encoding(UTF-8)',$path))
{
returnApplicationError("Cannot read abstract file: $path ($!)\nERRORCODE|111|\n");
}
printJsonHeader;
my #lines = <$fh>;
my $abstract_text = '';
foreach my $line (#lines)
{
$abstract_text .= $line;
}
my $json = encode_json($abstract_text);
close $fh;
print $json;
By using that code, I get this error;
hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this)
error message also point out that the problem is in this line;
my $json = encode_json($abstract_text);
I want to send the data back as a string (which is in UTF-8). Please help.
I assume you're using either JSON or JSON::XS.
Both allow for non-reference data, but not via the procedural encode_json routine.
You'll need to use the object-oriented approach:
use strict; # obligatory
use warnings; # obligatory
use JSON::XS;
my $encoder = JSON::XS->new();
$encoder->allow_nonref();
print $encoder->encode('Hello, world.');
# => "Hello, world."
Using this code, I am only getting an array of values name and extension listed one after the other, how do I get a separate {name:"tom", extension:"012345"} for each hash I am creating? i.e I would like [{name:"tom", extension:"012345"}, {name:"tim", extension:"66666"}]
#!/usr/bin/perl
use strict;
use warnings;
use lib $ENV{HOME} . '/perl/lib/perl5';
use lib '/home/accounts/lib';
use Data::Dump qw(dump);
use Lib::Phonebook;
use JSON;
my $ldap = Lib::Phonebook->new();
my (#names, #numbers, $count, $name_number_count, #result);
#names = $ldap->list_telephone_account_names();
#numbers = $ldap->list_telephone_account_numbers();
$name_number_count = #names;
$count = 0;
for $count (0 .. $name_number_count - 1) {
my %hash = ( "name" => $names[$count], "extension" => $numbers[$count] );
push( #result, %hash );
}
my $json = encode_json \#result;
print dump $json;
Instead of
push( #result, %hash );
do:
push( #result, \%hash );
The original statement copies the entire hash to a list (as you discovered that becomes key, value, key, value, ...), whereas adding the backslash pushes a reference to the hash (without copying anything else) and serializing all that to JSON creates a nested structure.
I'm trying to edit an old perl script and I'm a complete beginner. The request from the server returns as:
$VAR1 = [
{
'keywords' => [
'bare knuckle boxing',
'support group',
'dual identity',
'nihilism',
'support',
'rage and hate',
'insomnia',
'boxing',
'underground fighting'
],
}
];
How can I parse this JSON string to grab:
$keywords = "bare knuckle boxing,support group,dual identity,nihilism,support,rage and hate,insomnia,boxing,underground fighting"
Full perl code
#!/usr/bin/perl
use LWP::Simple; # From CPAN
use JSON qw( decode_json ); # From CPAN
use Data::Dumper; # Perl core module
use strict; # Good practice
use warnings; # Good practice
use WWW::TheMovieDB::Search;
use utf8::all;
use Encode;
use JSON::Parse 'json_to_perl';
use JSON::Any;
use JSON;
my $api = new WWW::TheMovieDB::Search('APIKEY');
my $img = $api->type('json');
$img = $api->Movie_imdbLookup('tt0137523');
my $decoded_json = decode_json( encode("utf8", $img) );
print Dumper $decoded_json;
Thanks.
Based on comments and on your recent edit, I would say that what you are asking is how to navigate a perl data structure, contained in the variable $decoded_json.
my $keywords = join ",", #{ $decoded_json->[0]{'keywords'} };
say qq{ #{ $arrayref->[0]->{'keywords'} } };
As TLP pointed out, all you've shown is a combination of perl arrays/hashes. But you should look at the JSON.pm documentation, if you have a JSON string.
The result you present is similar to json, but the Perl-variant of it. (ie => instead of : etc). I don't think you need to look into the json part of it, As you already got the data. You just need to use Perl to join the data into a text string.
Just to eleborate on the solution to vol7ron :
#get a reference to the list of keywords
my $keywords_list = $decoded_json->[0]{'keywords'};
#merge this list with commas
my $keywords = join(',', #$keywords_list );
print $keywords;