I've found some data that someone is downloading into a JSON file (I think! - I'm a newb!). The file contains data on nearly 600 football players.
Here's the file: https://raw.githubusercontent.com/llimllib/fantasypl_stats/f944410c21f90e7c5897cd60ecca5dc72b5ab619/data/players.1426687570.json
Is there a way I can grab some of the data and convert it to .csv? Specifically the 'Fixture History'?
Thanks in advance for any help :)
Here is a solution using jq
If the file filter.jq contains
.[]
| {first_name, second_name, all:.fixture_history.all[]}
| [.first_name, .second_name, .all[]]
| #csv
and data.json contains the sample data then the command
jq -M -r -f filter.jq data.json
will produce the output (note only 10 rows shown here)
"Wojciech","Szczesny","16 Aug 17:30",1,"CRY(H) 2-1",90,0,0,0,1,0,0,0,0,0,1,0,13,7,0,55,2
"Wojciech","Szczesny","23 Aug 17:30",2,"EVE(A) 2-2",90,0,0,0,2,0,0,0,0,0,0,0,5,9,-9306,55,1
"Wojciech","Szczesny","31 Aug 16:00",3,"LEI(A) 1-1",90,0,0,0,1,0,0,0,1,0,2,0,7,15,-20971,55,1
"Wojciech","Szczesny","13 Sep 12:45",4,"MCI(H) 2-2",90,0,0,0,2,0,0,0,0,0,6,0,12,17,-39686,55,3
"Wojciech","Szczesny","20 Sep 15:00",5,"AVL(A) 3-0",90,0,0,1,0,0,0,0,0,0,2,0,14,22,-15931,55,6
"Wojciech","Szczesny","27 Sep 17:30",6,"TOT(H) 1-1",90,0,0,0,1,0,0,0,0,0,4,0,10,13,-5389,55,3
"Wojciech","Szczesny","05 Oct 14:05",7,"CHE(A) 0-2",90,0,0,0,2,0,0,0,0,0,1,0,3,9,-8654,55,1
"Wojciech","Szczesny","18 Oct 15:00",8,"HUL(H) 2-2",90,0,0,0,2,0,0,0,0,0,2,0,7,9,-824,54,1
"Wojciech","Szczesny","25 Oct 15:00",9,"SUN(A) 2-0",90,0,0,1,0,0,0,0,0,0,3,0,16,22,-11582,54,7
JSON is a more detailed data format than CSV - it allows for more complex data structures. Inevitably if you do this, you 'lose detail'.
If you want to fetch it automatically - that's doable, but I've skipped it because 'doing' https URLs is slightly more complicated.
So assuming you've downloaded your file, here's a possible solution in Perl (You've already got one for Python - both are very powerful scripting languages, but can pretty much cover the same ground - so it's as much a matter of taste as to which you use).
#!/usr/bin/perl
use strict;
use warnings;
use JSON;
my $file = 'players.json';
open( my $input, "<", $file ) or die $!;
my $json_data = decode_json(
do { local $/; <$input> }
);
foreach my $player_id ( keys %{$json_data} ) {
foreach my $fixture (
#{ $json_data->{$player_id}->{fixture_history}->{all} } )
{
print join( ",",
$player_id, $json_data->{$player_id}->{web_name},
#{$fixture}, "\n", );
}
}
Hopefully you can see what's going on here - you load the file $input, and decode_json to create a data structure.
This data structure is a nested hash (perl's term for the type of data structure). hashes are key-value pairs.
So we extract the keys from this hash - which is the ID number right at the beginning of each entry.
Then we loop through each of them - extracting the the fixture_history array. And for each element in that array, we print the player ID, their web_name and then the data from fixture_history.
This gives output like:
1,Szczesny,10 Feb 19:45,25,LEI(H) 2-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-2413,52,0,
1,Szczesny,21 Feb 15:00,26,CRY(A) 2-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-2805,52,0,
1,Szczesny,01 Mar 14:05,27,EVE(H) 2-0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1862,52,0,
1,Szczesny,04 Mar 19:45,28,QPR(A) 2-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1248,52,0,
1,Szczesny,14 Mar 15:00,29,WHU(H) 3-0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1897,52,0,
Does this make sense?
Python has some good libraries for doing this. If you copy the following code into a file and save it as fix_hist.py or something, then save your JSON file as file.json in the same directory, it will create a csv file with the fixture histories each saved as a row. Just run python fix_hist.py in your command prompt (or terminal for mac):
import csv
import json
json_data = open("file.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
f.writerow(j)
json_data.close()
To add additional data to the fixture history, you can add insert statements before writing the rows:
import csv
import json
json_data = open("file.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
arr = []
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
try:
j.insert(0,str(data[i]["first_name"]))
except:
j.insert(0,'error')
try:
j.insert(1,data[i]["web_name"])
except:
j.insert(1,'error')
try:
f.writerow(j)
except:
f.writerow(['error','error'])
json_data.close()
With insert(), just indicate the position in the row you want the data point to occupy as the first argument.
Related
I have tons of csv-Files to read into Spark (Databricks) with 100+ columns. I do not want to specify the schema manually and have thought of using the following way. Read in a "reference" csv File, get the schema from this file and apply it as "reference_schema" to all other files I need to read in. Code would look as follows (but I cannot get it to work).
# File location and type
file_location = "/FileStore/tables/reference_file_with_ok_schema.csv"
file_type = "csv"
# CSV options
infer_schema = "True"
first_row_is_header = "True"
delimiter = ";"
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
mySchema = df.schema ###this is probably where I go wrong
display(df)
Next I would apply mySchema as the reference Schema for new csv's like in the following example:
# File location and type
file_location = "/FileStore/tables/all_other_files.csv"
file_type = "csv"
# CSV options
first_row_is_header = "True"
delimiter = ";"
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.schema(mySchema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
This only produces nulls
Thanks in advance for your help and
best regards
Alex
You have the right approach.
You can check these two options mode and columnNameOfCorruptRecord. By default, mode=PERMISSIVE which creates NULL records when line does not match schema.
That is probably why you have NULL records in you dataframe, it means the schema mySchema and the schema of file all_other_files are different.
The first thing to check is to infer the schema of all_other_files and compare it with mySchema. To do that easily, schema object have a json method which output them as JSON string. It is easier for human to compare two jsons than 2 schema objects.
mySchema.json()
If there is just one difference, the whole line will be set to NULL unfortunately.
I've a lot json file the structure of which looks like below:
{
key1: 'val1'
key2: {
'key21': 'someval1',
'key22': 'someval2',
'key23': 'someval3',
'date': '2018-07-31T01:30:30Z',
'key25': 'someval4'
}
key3: []
... some other objects
}
My goal is to get only these files where date field is from some period.
For example from 2018-05-20 to 2018-07-20.
I can't base on date of creation this files, because all of this was generated in one day.
Maybe it is possible using sed or similar program?
Fortunately, the date in this format can be compared as a string. You only need something to parse the JSONs, e.g. Perl:
perl -l -0777 -MJSON::PP -ne '
$date = decode_json($_)->{key2}{date};
print $ARGV if $date gt "2018-07-01T00:00:00Z";
' *.json
-0777 makes perl slurp the whole files instead of reading them line by line
-l adds a newline to print
$ARGV contains the name of the currently processed file
See JSON::PP for details. If you have JSON::XS or Cpanel::JSON::XS, you can switch to them for faster processing.
I had to fix the input (replace ' by ", add commas, etc.) in order to make the parser happy.
If your files actually contain valid JSON, the task can be accomplished in a one-liner with jq, e.g.:
jq 'if .key2.date[0:10] | (. >= "2018-05-20" and . <= "2018-07-31") then input_filename else empty end' *.json
This is just an illustration. jq has date-handling functions for dealing with more complex requirements.
Handling quasi-JSON
If your files contain quasi-JSON, then you could use jq in conjunction with a JSON rectifier. If your sample is representative, then hjson
could be used, e.g.
for f in *.qjson
do
hjson -j $f | jq --arg f "$f" '
if .key2.date[0:7] == "2018-07" then $f else empty end'
done
Try like this:
Find a online converter. (for example: https://codebeautify.org/json-to-excel-converter#) and convert Json to CSV
Open CSV file with Excel
Filter your data
I have two Json files which come from different OSes.
Both files are encoded in UTF-8 and contain UTF-8 encoded filenames.
One file comes from OS X and the filename is in NFD form: (od -bc)
0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
v e t l a ́ ** / H o u s e m e
the second contains the same filename but in NFC form:
000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
v e t l á ** / H o u s e m e s
As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize for handling it.
I'm reading both files with the next:
my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;
The read_file is from File::Slurp and decode_json from the JSON::XS.
Reading the JSON into perl structure, from one json file the filename comes into key position and from the second file comes into the values. I need to search when the hash key from the 1st hash is equvalent to a value from the second hash, so need ensure than they are "binary" identical.
Tried the next:
grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
and
grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
produces for me the same output.
Now the questions:
How to simply read both json files to get the same normalization into the both $hashrefs?
or need after the decode_json run someting like on both hashes?
while(my($k,$v) = each(%$json1)) {
$copy->{ NFD($k) } = NFD($v);
}
In short:
How to read different JSON files to get the same normalization 'inside' the perl $href? It is possible to achieve somewhat nicer as explicitly doing NFD on each key value and creating another NFD normalized (big) copy of the hashes?
Some hints, suggestions - pleae...
Because my english is very bad, here is a simulation of the problem
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use Data::Dumper;
use JSON::XS;
#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';
#save as NFC - this comes from "FreeBSD"
my $jnfc = JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;
#save as NFD - this comes from "OS X"
my $jnfd = JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something"
my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
$jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK
While searching the right solution for your question i discovered: the software is c*rp :) See: https://stackoverflow.com/a/17448888/632407 .
Anyway, found the solution for your particular question - how to read json with filenames regardless of normalization:
instead of your:
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
use the next:
#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...
sub get_json_from_utf8_file {
my $file = shift;
return
decode_json #let parse the json to perl
encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
NFC #conv. to precomposed normalization - regardless of the source
read_file #your file contains utf8 encoded text, so read it correctly
$file, { binmode => ':utf8' } ;
}
This should (at least i hope) ensure than regardles what decomposition uses the JSON content, the NFC will convert it to precomposed version and the JSON:XS will read parse it correctly to the same internal perl structure.
So your example prints:
something
without traversing the $json
The idea comes from Joseph Myers and Nemo ;)
Maybe some more skilled programmers will give more hints.
Even though it may be important right now only to convert a few file names to the same normalization for comparison, other unexpected problems could arise from almost anywhere if JSON data has a different normalization.
So my suggestion is to normalize the entire input from both sources as your first step before doing any parsing (i.e., at the same time you read the file and before decode_json). This should not corrupt any of your JSON structures since those are delimited using ASCII characters. Then your existing perl code should be able to blindly assume all UTF8 characters have the same normalization.
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
my $json1 = decode_json NFD($rawdata1);
my $json2 = decode_json NFD($rawdata2);
To make this process slightly faster (it should be plenty fast already, since the module uses fast XS procedures), you can find out whether one of the two data files is already in a certain normalization form, and then leave that file unchanged, and convert the other file into that form.
For example:
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
if (checkNFD($rawdata1)) {
# then you know $file1 is already in Normalization Form D
# (i.e., it was formed by canonical decomposition).
# so you only need to convert $file2 into NFD
$rawdata2 = NFD($rawdata2);
}
my $json1 = decode_json $rawdata1;
my $json2 = decode_json $rawdata2;
Of course, you would naturally have to experiment now in the development time to see if one or other of the input files is already in a normalized form, and then in your final version of the code, you would no longer need a conditional statement, but simply convert the other input file into the same normalized form.
Also note that it is suggested to produce output in NFC form (if your program produces any output that would be stored and used later). See here, for example: http://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html
Hm. I can't advice you some better "programming" solution. But why simply doesn't run
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < freebsd.json >bsdok.json
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < osx.json >osxok.json
and now your script can read and use both because they are both in the same normalisation? So instead searching for som programming solution inside of your script, solve the problem before entering to the script. (The second command is unnecessary - simple convert on the file level. Sure is more easy as traversing data structures...
Instead of traversing the data structure manually, let a module handle this for you.
Data::Visitor
Data::Rmap
Data::Dmap
I am able to convert a hard coded json string into perl hashes however if i want to convert a complete json file into perl data structures which can be parsed later in any manner, I am getting the folloring error.
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at json_vellai.pl line 9
use JSON::PP;
$json= JSON::PP->new()
$json = $json->allow_singlequote([$enable]);
open (FH, "jsonsample.doc") or die "could not open the file\n";
#$fileContents = do { local $/;<FH>};
#fileContents = <FH>;
#print #fileContents;
$str = $json->allow_barekey->decode(#filecontents);
foreach $t (keys %$str)
{
print "\n $t -- $str->{$t}";
}
This is how my code looks .. plz help me out
It looks to me like decode doesn't want a list, it wants a scalar string.
You could slurp the file:
undef $/;
$fileContents = <FH>;
I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out