I'm trying to convert this data (https://rest.kegg.jp/get/br:ko00001/json) in JSON to CSV/TSV. I've been able to do so in awk and sed but I'm learning Perl for bigger projects so it'd be helpful to learn to do it without the JSON module.
sed -E 's/^\t{2}"name"/\t\t"level 1"/g;s/^\t{3}"name"/\t\t\t"level 2"/g;s/^\t{4}"name"/\t\t\t\t"level 3"/g;s/^\t{5}"name"/\t\t\t\t\t"level 4"/g' json.json | awk 'BEGIN {OFS="\t"} NR > 4 {match($0, /"([^"]+)": *("[^"]*")/, a)} {tag = a[1]; val = gensub(/^"|"$/, "", "g", a[2]); f[tag] = val; if (tag == "level 4") {print f["level 1"], f["level 2"], f["level 3"], f["level 4"]}}' > table.tsv
Above is how I made it by awk and sed. json.json is downloaded from the link.
Here is what I've been trying so far in Perl without the JSON module. I'd like to do it this way to learn about the data structure and how Perl works.
use strict;
my $brite_hierarchy_filepath = shift #ARGV;
open my $brite_hierarchy, '<:utf8', $brite_hierarchy_filepath or die q{Can't open $brite_hierarchy_filepath: $!\n};
while (my $line = <$brite_hierarchy>) {
next if $. == 4;
chomp $line;
$line =~ s/\A\t{2}"name"/"level_1"/;
$line =~ s/\A\t{3}"name"/"level_2"/;
$line =~ s/\A\t{4}"name"/"level_3"/;
$line =~ s/\A\t{5}"name"/"level_4"/;
my ($tag) = $line =~ /\A"(.*?)"/;
my ($value) = $line =~ /\A"level_[1-4]":"(.*?)"/;
my %field = ($tag => $value) unless $tag eq "" && $value eq "";
for (keys %field) {
print join("\t", $field{"level_1"}, $field{"level_2"}, $field{"level_3"}, $field{"level_4"}, "\n");
};
last if eof $brite_hierarchy;
};
This is how the data looks like briefly.
{
"name":"ko00001",
"children":[
{
"name":"09100 Metabolism",
"children":[
{
"name":"09101 Carbohydrate metabolism",
"children":[
{
"name":"00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]",
"children":[
{
"name":"K00844 HK; hexokinase [EC:2.7.1.1]"
},
{
"name":"K12407 GCK; glucokinase [EC:2.7.1.2]"
},
{
"name":"K00845 glk; glucokinase [EC:2.7.1.2]"
...
And the desired output in TSV format.
09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] K00844 HK; hexokinase [EC:2.7.1.1]
09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] K12407 GCK; glucokinase [EC:2.7.1.2]
09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] K00845 glk; glucokinase [EC:2.7.1.2]
I would always suggest a JSON parser, but you can indeed treat this as just a fixed text file if you can guarantee that format never changes. In production, you usually can't. But if it's a one-off, then it certainly works.
The example input you pasted into the question has spaces, not tabs, so your code would not work on it. Neither would mine. My input is copied from your link, and has tabs.
Your regex patterns seem a bit complicated. You can always have the same trivial pattern, but just need to vary the number of tabs before each name. The trick is to skip to the next line whenever you find a name that is not the final column, and to reset the entire structure on the first column. I chose to use an array rather than a hash, as that makes more sense and we can just join later when we output. Finally, say is like print but with a built-in newline.
use strict;
use warnings;
use feature 'say';
my #names;
while (<DATA>) {
if ( m/^\t"name":"(.+)"/) {
undef #names;
$names[0] = $1;
next;
}
if (m/^\t\t"name":"(.+)"/) {
$names[1] = $1;
next;
}
if (m/^\t\t\t"name":"(.+)"/) {
$names[2] = $1;
next;
}
if (m/^\t\t\t\t"name":"(.+)"/) {
$names[3] = $1;
next;
}
if (m/^\t\t\t\t\t"name":"(.+)"/) {
$names[4] = $1;
say join "\t", #names;
}
}
__DATA__
{
"name":"ko00001",
"children":[
{
"name":"09100 Metabolism",
"children":[
{
"name":"09101 Carbohydrate metabolism",
"children":[
{
"name":"00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]",
"children":[
{
"name":"K00844 HK; hexokinase [EC:2.7.1.1]"
},
{
"name":"K12407 GCK; glucokinase [EC:2.7.1.2]"
},
use v5.14;
use warnings;
use open ":std", ":encoding(UTF-8)";
my #names;
while ( <> ) {
my ( $tabs, $name ) = /^\t{2}(\t*)"name": "(.*)"/
or next;
my $level = length( $tabs );
$names[ $level ] = $name;
say join "\t", #names if $level == 4;
}
It's horrible not to use a JSON parser.
Although the code doesn't look very clean, I manage to create the table in TSV format perfectly like the one produced by sed and awk.
Thanks all for the info on using the module JSON, but with this way, I learn a bit more about using the variable outside of the loop block, we can store it for the next turn in the loop.
use strict;
my $brite_hierarchy_filepath = shift #ARGV;
open my $brite_hierarchy, '<:utf8', $brite_hierarchy_filepath or die q{Can't open $brite_hierarchy_filepath: $!\n};
my $previous_1;
my $previous_2;
my $previous_3;
while (my $line = <$brite_hierarchy>) {
next if $. == 4;
chomp $line;
# change accordingly to the hierarchical levels
$line =~ s/\A\t{2}"name"/"level_1"/;
$line =~ s/\A\t{3}"name"/"level_2"/;
$line =~ s/\A\t{4}"name"/"level_3"/;
$line =~ s/\A\t{5}"name"/"level_4"/;
# find the categories and put them into a hash
my ($tag) = $line =~ /\A"(.*?)"/;
my ($value) = $line =~ /\A"level_[1-4]":"(.*?)"/;
my %field = ($tag => $value) unless $tag eq "" && $value eq "";
for (keys %field) {
$previous_1 = $field{"level_1"} if $_ eq "level_1" && $field{"level_1"} ne "";
$previous_2 = $field{"level_2"} if $_ eq "level_2" && $field{"level_2"} ne "";
$previous_3 = $field{"level_3"} if $_ eq "level_3" && $field{"level_3"} ne "";
print join("\t", $previous_1, $previous_2, $previous_3, $field{"level_4"}, "\n") unless $field{"level_4"} eq "";
};
last if eof $brite_hierarchy;
};
Related
I have JSON code that I'm pulling with key names that are the same and I'm trying to pull the values from the keys one at a time and pass them to variables (in a loop) in a perl script but it pulls all of the values at one time instead of iterating through them. I'd like to pull a value from a key and pass it to a variable then iterate through the loop again for the next value. The amount of data changes in JSON so the amount of identical keys will grow.
Perl Script Snippet
#!/usr/bin/perl
use warnings;
use strict;
use JSON::XS;
my $res = "test.json";
my $txt = do {
local $/;
open my $fh, "<", $res or die $!;
<$fh>;
};
my $json = decode_json($txt);
for my $mdata (#{ $json->{results} }) {
my $sitedomain = "$mdata->{custom_fields}->{Domain}";
my $routerip = "$mdata->{custom_fields}->{RouterIP}";
#vars
my $domain = $sitedomain;
my $host = $routerip;
print $domain;
print $host;
}
Print $host variable
print $host;
192.168.201.1192.168.202.1192.168.203.1
Print $domain variable
print $domain;
site1.global.localsite2.global.localsite3.global.local
JSON (test.json)
{
"results": [
{
"id": 37,
"url": "http://global.local/api/dcim/sites/37/",
"display": "Site 1",
"name": "Site 1",
"slug": "site1",
"custom_fields": {
"Domain": "site1.global.local",
"RouterIP": "192.168.201.1"
}
},
{
"id": 38,
"url": "http://global.local/api/dcim/sites/38/",
"display": "Site 2",
"name": "Site 2",
"slug": "site2",
"custom_fields": {
"Domain": "site2.global.local",
"RouterIP": "192.168.202.1"
}
},
{
"id": 39,
"url": "http://global.local/api/dcim/sites/39/",
"display": "Site 3",
"name": "Site 3",
"slug": "site3",
"custom_fields": {
"Domain": "site3.global.local",
"RouterIP": "192.168.203.1"
}
}
]
}
Your code produces expected result if you add \n to print statement. You can utilize say instead of print if there is no format required.
use warnings;
use strict;
use feature 'say';
use JSON::XS;
my $res = "test.json";
my $txt = do {
local $/;
open my $fh, "<", $res or die $!;
<$fh>;
};
my $json = decode_json($txt);
for my $mdata (#{ $json->{results} }) {
my $sitedomain = "$mdata->{custom_fields}->{Domain}";
my $routerip = "$mdata->{custom_fields}->{RouterIP}";
#vars
my $domain = $sitedomain;
my $host = $routerip;
say "$domain $host";
}
The code can be re-written in shorter form as following
use strict;
use warnings;
use feature 'say';
use JSON;
my $fname = 'router_test.json';
my $txt = do {
local $/;
open my $fh, "<", $fname or die $!;
<$fh>;
};
my $json = from_json($txt);
say "$_->{custom_fields}{Domain} $_->{custom_fields}{RouterIP}" for #{$json->{results}};
It sounds like you want to "slice" the data. You could buffer in code, or collect unique values later. Let's modify what you started with, and make some tweaks:
n.b. No need to quote my $sitedomain = "$mdata->{custom_fields}->{Domain}";. The content of the JSON is already a string, and forcing Perl to make another string by interpolating it is unnecessary.
n.b.2 JSON::XS works automatically if it's installed.
my %domains;
my %ips;
for my $mdata (#{ $json->{results} }) {
my $sitedomain = $mdata->{custom_fields}->{Domain};
my $routerip = $mdata->{custom_fields}->{RouterIP};
# Collect and count all the unique domains and IPs by storing them as hash keys
$domains{$sitedomain} += 1;
$ips{$routerip} += 1;
}
for my $key (keys %domains) {
printf "%s %s\n", $key, $domains{$key};
# and so on
}
If we don't know the custom fields, we can play with nested hashes to collect it all:
my %fields;
for my $mdata (#{ $json->{results} }) {
for my $custom_field (keys %{ $mdata->{custom_fields} }) {
$fields{$custom_field}{$mdata->{custom_fields}{$custom_field}} += 1;
}
}
for my $custom_field (keys %fields) {
print "$custom_field:\n";
for my $unique_value (keys %{ $fields{$custom_field} }){
printf "%s - %s\n", $unique_value, $fields{$custom_field}{$unique_value};
}
}
Example output:
RouterIP:
192.168.201.1 - 1
192.168.203.1 - 1
192.168.202.1 - 1
Domain:
site2.global.local - 1
site1.global.local - 1
site3.global.local - 1
... or something like that. Nested structures lead very quickly to messy code. You can mitigate it by dereferencing the substructures. It could also be more predictable if we work with a known list of keys e.g.
my #known_keys = qw/RouterIP Domain/;
for my $mdata (#{ $json->{results} }) {
for my $custom_field (#known_keys) {
if (exists $fields{$custom_field}) {
$fields{$custom_field}{$mdata->{custom_fields}{$custom_field}} += 1;
}
}
}
If the JSON file is massive you may run out of memory. For this you would need to look into a package like JSON::SL or JSON::Streaming::Reader. They're more involved to use but prevent you from needing to load the whole file into memory. There are also unix tools like jq that provide the same powers.
My custom code (on Perl) give next wrong JSON, missing comma between blocks:
{
"data": [{
"{#LOGFILEPATH}": "/tmp/QRZ2007.tcserverlogs",
"{#LOGFILE}": "QRZ2007"
} **missing comma** {
"{#LOGFILE}": "ARZ2007",
"{#LOGFILEPATH}": "/tmp/ARZ2007.tcserverlogs"
}]
}
My terrible code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use utf8;
use JSON;
binmode STDOUT, ":utf8";
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
opendir(DIR, $dir) or die $!;
print '{"data": [';
while (my $file = readdir(DIR)) {
next unless (-f "$dir/$file");
next unless ($file =~ m/\.tcserverlogs$/);
my $fullPath = "$dir/$file";
my $filenameshort = basename($file, ".tcserverlogs");
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
print $json->encode($data_to_json);
}
print ']}'."\n";
closedir(DIR);
exit 0;
Dear Team i am not a programmer, please any idea how fix it, thank you!
If you do not print a comma, you will not get a comma.
You are trying to build your own JSON string from pre-encoded pieces of smaller data structures. That will not work unless you tell Perl when to put commas. You could do that, but it's easier to just collect all the data into a Perl data structure that is equivalent to the JSON string you want to produce, and encode the whole thing in one go when you're done.
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
my #data;
opendir( DIR, $dir ) or die $!;
while ( my $file = readdir(DIR) ) {
next unless ( -f "$dir/$file" );
next unless ( $file =~ m/\.tcserverlogs$/ );
my $fullPath = "$dir/$file";
my $filenameshort = basename( $file, ".tcserverlogs" );
my $data_to_json = { "{#LOGFILEPATH}" => $fullPath, "{#LOGFILE}" => $filenameshort };
push #data, $data_to_json;
}
closedir(DIR);
print $json->encode( { data => \#data } );
I will have a possibly very large JSON file and I want to stream from it instead of load it all into memory. Based on the following statement (I added the emphasis) from JSON::XS, I believe it won't suit my needs. Is there a Perl 5 JSON module that will stream the results from the disk?
In some cases, there is the need for incremental parsing of JSON texts. While this module always has to keep both JSON text and resulting Perl data structure in memory at one time, it does allow you to parse a JSON stream incrementally. It does so by accumulating text until it has a full JSON object, which it then can decode. This process is similar to using decode_prefix to see if a full JSON object is available, but is much more efficient (and can be implemented with a minimum of method calls).
To clarify, the JSON will contain an array of objects. I want to read one object at a time from the file.
In terms of ease of use and speed, JSON::SL seems to be the winner:
#!/usr/bin/perl
use strict;
use warnings;
use JSON::SL;
my $p = JSON::SL->new;
#look for everthing past the first level (i.e. everything in the array)
$p->set_jsonpointer(["/^"]);
local $/ = \5; #read only 5 bytes at a time
while (my $buf = <DATA>) {
$p->feed($buf); #parse what you can
#fetch anything that completed the parse and matches the JSON Pointer
while (my $obj = $p->fetch) {
print "$obj->{Value}{n}: $obj->{Value}{s}\n";
}
}
__DATA__
[
{ "n": 0, "s": "zero" },
{ "n": 1, "s": "one" },
{ "n": 2, "s": "two" }
]
JSON::Streaming::Reader was okay, but it is slower and suffers from too verbose an interface (all of these coderefs are required even though many do nothing):
#!/usr/bin/perl
use strict;
use warnings;
use JSON::Streaming::Reader;
my $p = JSON::Streaming::Reader->for_stream(\*DATA);
my $obj;
my $attr;
$p->process_tokens(
start_array => sub {}, #who cares?
end_array => sub {}, #who cares?
end_property => sub {}, #who cares?
start_object => sub { $obj = {}; }, #clear the current object
start_property => sub { $attr = shift; }, #get the name of the attribute
#add the value of the attribute to the object
add_string => sub { $obj->{$attr} = shift; },
add_number => sub { $obj->{$attr} = shift; },
#object has finished parsing, it can be used now
end_object => sub { print "$obj->{n}: $obj->{s}\n"; },
);
__DATA__
[
{ "n": 0, "s": "zero" },
{ "n": 1, "s": "one" },
{ "n": 2, "s": "two" }
]
To parse 1,000 records it took JSON::SL .2 seconds and JSON::Streaming::Reader 3.6 seconds (note, JSON::SL was being fed 4k at a time, I had no control over JSON::Streaming::Reader's buffer size).
Have you looked at JSON::Streaming::Reader which shows up as first while searching for 'JSON Stream' on search.cpan.org?
Alternatively JSON::SL found by searching for 'JSON SAX' - not quite as obvious search terms, but what you describe sounds like a SAX parsers for XML.
It does so by accumulating text until it has a full JSON object, which it then can decode.
This is what screws your over. A JSON document is one object.
You need to define more clearly what you want from incremental parsing. Are you looking for one element of a large mapping? What are you trying to do with the information you read out/write?
I don't know any library that will incrementally parse JSON data by reading one element out of an array at once. However this is quite simple to implement yourself using a finite state automaton (basically your file has the format \s*\[\s*([^,]+,)*([^,]+)?\s*\]\s* except that you need to parse commas in strings correctly.)
Did you try to skip first right braket [ and then the commas , :
$json->incr_text =~ s/^ \s* \[ //x;
...
$json->incr_text =~ s/^ \s* , //x;
...
$json->incr_text =~ s/^ \s* \] //x;
like in the third example :
http://search.cpan.org/dist/JSON-XS/XS.pm#EXAMPLES
If you have control over how you're generating your JSON, then I suggest turning pretty formatting off and printing one object per line. This makes parsing simple, like so:
use Data::Dumper;
use JSON::Parse 'json_to_perl';
use JSON;
use JSON::SL;
my $json_sl = JSON::SL->new();
use JSON::XS;
my $json_xs = JSON::XS->new();
$json_xs = $json_xs->pretty(0);
#$json_xs = $json_xs->utf8(1);
#$json_xs = $json_xs->ascii(0);
#$json_xs = $json_xs->allow_unknown(1);
my ($file) = #ARGV;
unless( defined $file && -f $file )
{
print STDERR "usage: $0 FILE\n";
exit 1;
}
my #cmd = ( qw( CMD ARGS ), $file );
open my $JSON, '-|', #cmd or die "Failed to exec #cmd: $!";
# local $/ = \4096; #read 4k at a time
while( my $line = <$JSON> )
{
if( my $obj = json($line) )
{
print Dumper($obj);
}
else
{
die "error: failed to parse line - $line";
}
exit if( $. == 5 );
}
exit 0;
sub json
{
my ($data) = #_;
return decode_json($data);
}
sub json_parse
{
my ($data) = #_;
return json_to_perl($data);
}
sub json_xs
{
my ($data) = #_;
return $json_xs->decode($data);
}
sub json_xs_incremental
{
my ($data) = #_;
my $result = [];
$json_xs->incr_parse($data); # void context, so no parsing
push( #$result, $_ ) for( $json_xs->incr_parse );
return $result;
}
sub json_sl_incremental
{
my ($data) = #_;
my $result = [];
$json_sl->feed($data);
push( #$result, $_ ) for( $json_sl->fetch );
# ? error: JSON::SL - Got error CANT_INSERT at position 552 at json_to_perl.pl line 82, <$JSON> line 2.
return $result;
}
I'm parsing a CSV file in which each line look something as below.
10998,4499,SLC27A5,Q9Y2P5,GO:0000166,GO:0032403,GO:0005524,GO:0016874,GO:0047747,GO:0004467,GO:0015245,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
There seems to be trailing commas at the end of each line.
I want to get the first term, in this case "10998" and get the number of GO terms related to it.
So my output in this case should be,
Output:
10998,7
But instead it shows 299. I realized overall there are 303 commas in each line. And I'm not able to figure out an easy way to remove trailing commas. Can anyone help me solve this issue?
Thanks!
My Code:
use strict;
use warnings;
open my $IN, '<', 'test.csv' or die "can't find file: $!";
open(CSV, ">GO_MF_counts_Genes.csv") or die "Error!! Cannot create the file: $!\n";
my #genes = ();
my $mf;
foreach my $line (<$IN>) {
chomp $line;
my #array = split(/,/, $line);
my #GO = splice(#array, 4);
my $GO = join(',', #GO);
$mf = count($GO);
print CSV "$array[0],$mf\n";
}
sub count {
my $go = shift #_;
my $count = my #go = split(/,/, $go);
return $count;
}
I'd use juanrpozo's solution for counting but if you still want to go your way, then remove the commas with regex substitution.
$line =~ s/,+$//;
I suggest this more concise way of coding your program.
Note that the line my #data = split /,/, $line discards trailing empty fields (#data has only 11 fields with your sample data) so will produce the same result whether or not trailing commas are removed beforehand.
use strict;
use warnings;
open my $in, '<', 'test.csv' or die "Cannot open file for input: $!";
open my $out, '>', 'GO_MF_counts_Genes.csv' or die "Cannot open file for output: $!";
foreach my $line (<$in>) {
chomp $line;
my #data = split /,/, $line;
printf $out "%s,%d\n", $data[0], scalar grep /^GO:/, #data;
}
You can apply grep to #array
my $mf = grep { /^GO:/ } #array;
assuming $array[0] never matches /^GO:/
For each your line:
foreach my $line (<$IN>) {
my ($first_term) = ($line =~ /(\d+),/);
my #tmp = split('GO', " $line ");
my $nr_of_GOs = #tmp - 1;
print CSV "$first_term,$nr_of_GOs\n";
}
I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:
<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>
Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.
I've tried many regular expressions such as:
$lines =~ / (<td>) /
OR
$lines =~ / <td>(.*)< /
OR
$lines =~ / >(.*)< /
My current program looks like so:
#!perl -w
open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;
open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;
my $lines = join '', <INPUT_FILE>;
print "Hello 2\n";
if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
if ($lines =~ / <td>(.*)< / ) {
print "this is the 2nd 1: \n";
print $1;
print "the word was: $1.\n";
$Time = $1;
print $Time;
print OUTPUT_FILE $Time;
} else {
print "2ND IF FAILED\n";
}
} else {
print "THIS FAILED\n";
}
close(INPUT_FILE);
close(OUTPUT_FILE);
Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
HTML::Parser
HTML::TreeBuilder
HTML::TableExtract
Use HTML::TableExtract. Really.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $file = 'Table3.htm';
unless ( -e $file ) {
my $rc = getstore(
'http://www.ntsb.gov/aviation/Table3.htm',
$file);
die "Failed to download document\n" unless $rc == 200;
}
my #headers = qw( Year Fatalities );
my $te = HTML::TableExtract->new(
headers => \#headers,
attribs => { id => 'myTable' },
);
$te->parse_file($file);
my ($table) = $te->tables;
print join("\t", #headers), "\n";
for my $row ($te->rows ) {
print join("\t", #$row), "\n";
}
This is what I meant in another post by "task-specific" HTML parsers.
You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.
That's an easy one:
my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my #stuff = $html =~ />([^<]+)</g;
print join (", ", #stuff), "\n";
See http://codepad.org/qz9d5Bro if you want to try running it.