Paths from json file don't expand in Snakemake - json

I have a Snakemake pipeline where I get my input/output paths for my file folders from a json file and use the expand function to obtain the paths.
import json
with open('config.json', 'r') as f:
config = json.load(f)
wildcard = ["1234", "5678"]
rule them_all:
input:
expand('config["data_input"]/data_{wc}.tab', wc = wildcard)
output:
expand('config["data_output"]/output_{wc}.rda', wc = wildcard)
shell:
"Rscript ./my_script.R"
My config.json is
{
"data_input": "/very/long/path",
"data_output": "/slightly/different/long/path"
}
While trying to make a dry run, though, I get the following error:
$ snakemake -np
Building DAG of jobs...
MissingInputException in line 12 of /path/to/Snakefile:
Missing input files for rule them_all:
config["data_input"]/data_1234.tab
config["data_input"]/data_5678.tab
The files are there and their path is /very/long/path/data_1234.tab.
This is probably a low-hanging fruit, but what am I doing wrong in the syntax for the expansion? Or is it the way I call the json file?

expand() does not interpret access to dictionaries for its first argument while expanding the path with quotation marks, so this operation with expand() has to be done in a wildcard.
The correct syntax, in this case, would be e.g.
expand('{input_folder}/data_{wc}.tab', wc = wildcard, input_folder = config["data_input"])

Related

How to infer a schema from a "reference file" and apply it as a reference to files to be read in?

I have tons of csv-Files to read into Spark (Databricks) with 100+ columns. I do not want to specify the schema manually and have thought of using the following way. Read in a "reference" csv File, get the schema from this file and apply it as "reference_schema" to all other files I need to read in. Code would look as follows (but I cannot get it to work).
# File location and type
file_location = "/FileStore/tables/reference_file_with_ok_schema.csv"
file_type = "csv"
# CSV options
infer_schema = "True"
first_row_is_header = "True"
delimiter = ";"
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
mySchema = df.schema ###this is probably where I go wrong
display(df)
Next I would apply mySchema as the reference Schema for new csv's like in the following example:
# File location and type
file_location = "/FileStore/tables/all_other_files.csv"
file_type = "csv"
# CSV options
first_row_is_header = "True"
delimiter = ";"
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.schema(mySchema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
This only produces nulls
Thanks in advance for your help and
best regards
Alex
You have the right approach.
You can check these two options mode and columnNameOfCorruptRecord. By default, mode=PERMISSIVE which creates NULL records when line does not match schema.
That is probably why you have NULL records in you dataframe, it means the schema mySchema and the schema of file all_other_files are different.
The first thing to check is to infer the schema of all_other_files and compare it with mySchema. To do that easily, schema object have a json method which output them as JSON string. It is easier for human to compare two jsons than 2 schema objects.
mySchema.json()
If there is just one difference, the whole line will be set to NULL unfortunately.

Haskell Yesod and writeFile

I am trying to learn Yesod and trying to implement a simple REST app where everytime a I get a GET request I write something to a file. Right now I have the following handler function:
getTestR =
do
return $ writeFile "test.txt" "Just something"
return $ object ["result" .= "Ok"]
What I was expecting is that the file test.txt would be created and I would obtain a JSON with {result=Ok}. However, I am obtaining the JSON, but the file is not being created.
I guess the writeFile is not being evaluated because of the lazy evaluation, but I have no idea how to overcome this problem. Thanks in advance.
just use liftIO:
getTestR =
do
liftIO $ writeFile "test.txt" "Just something"
return $ object ["result" .= "Ok"]

Parsing JSON from shell script using JSON.sh

I'm working on parsing JSON data using JSON.sh. And I wanted to read data from json file (test.json) whose content will be something like,
{
"/home/ukrishnan/projects/test.yml": {
"LOG_DRIVER": "syslog",
"IMAGE": "mysql:5.6"
},
"/home/ukrishnan/projects/mysql/app.xml": {
"ENV_ACCOUNT_BRIDGE_ENDPOINT": "/u01/src/test/sample.txt"
}
}
And I try to parse this JSON using JSON.sh by using,
test_parser=`sh ./lib/JSON.sh < test/test.json`
echo $test_parser
It prints,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog" ["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6" ["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"} ["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt" ["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"} [] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
Whereas, the same command (sh ./lib/JSON.sh < test/test.json), if I run through terminal, it is printing with line breaks,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog"
["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6"
["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"}
["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt"
["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}
[] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
I wanted to read this and assign to bash variables like,
file_name='/home/ukrishnan/projects/test.yml'
key='LOG_DRIVER'
value='syslog'
As I'm almost completely new to shell script and grep or awk, I don't have much idea of how to achieve this. Any help on this would be greatly appreciated.
I wrote a JSON serializer / deserializer for gawk, if you're interested. Save that script and modify it, replacing everything above # === FUNCTIONS === with the following:
#!/usr/bin/gawk -f
# capture JSON string from beginning to end into a scalar variable
{ json = json ORS $0 }
END {
# objectify JSON string to the multilevel array "obj"
deserialize(json, obj)
for (filename in obj) {
print "file_name=" quote(filename)
for (key in obj[filename]) {
# print key="value"
print key "=" quote(obj[filename][key])
}
}
}
Do chmod 755 json.awk and execute it. Output will resemble this:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
ENV_ACCOUNT_BRIDGE_ENDPOINT="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
LOG_DRIVER="syslog"
IMAGE="mysql:5.6"
Hopefully the logic is reasonably easy to follow. If you prefer to output filename=, key=, and value= on every loop iteration, modify the nested for loops accordingly:
for (filename in obj) {
for (key in obj[filename]) {
print "file_name=" quote(filename)
print "key=" quote(key)
print "value=" quote(obj[filename][key])
}
}
That change will result in the following output:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
key="ENV_ACCOUNT_BRIDGE_ENDPOINT"
value="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
key="LOG_DRIVER"
value="syslog"
file_name="/home/ukrishnan/projects/test.yml"
key="IMAGE"
value="mysql:5.6"
Anyway, with that output, you can do something silly in BASH like this to populate and act upon the variables:
#!/bin/bash
./test.awk test5.json | while read -r line; do {
eval $line
[ "${line/=*/}" = "value" ] && {
echo "bash: file_name=$file_name"
echo "bash: key=$key"
echo "bash: value=$value"
echo "------"
}
}; done
It'd probably be more graceful just to do all processing within gawk from start to finish and not mess with the polyglot handoff, though.
Getting back to json.awk, if you prefer to keep json.awk modular for easy reuse in future projects, you could remove everything above # === FUNCTIONS ===, create a separate main.awk containing the code block at the top of this answer, and #include "json.awk" as a helper library pretty much anywhere outside of END {...} (just below the shbang, for example).
JSON.sh (from http://json.org) offers a nice bash friendly means of flattening out a JSON file. Which you've already provided how it looks in your question. So, the flatten form is the format:
[node] tab value
You have to think in UNIX script in extracting the information you want, you'll note the lines you're interested in actually follow this pattern:
["filename","key"] tab ["value"]
In regex notation, we replace:
filename with (.*)
key with (.*)
tab with \t
value with (.*)
We can retrieve the first, second and third matching groups with \1, \2, \3 respectively.
When used in sed we also note that these symbols []() need to be escaped with a backslash \, resulting in the following script:
./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d'
/home/ukrishnan/projects/test.yml,LOG_DRIVER,syslog
/home/ukrishnan/projects/test.yml,IMAGE,mysql:5.6
/home/ukrishnan/projects/mysql/app.xml,ENV_ACCOUNT_BRIDGE_ENDPOINT,/u01/src/test/sample.txt
Now we put the lines in a loop and for each line, we can extract out filename,key,value:
for line in $(./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d')
do
IFS="," read -ra arr <<< $line
filename=${arr[0]}
key=${arr[1]}
value=${arr[2]}
cat <<EOF
filename : $filename
key : $key
value : $value
EOF
done
Which outputs:
filename : /home/ukrishnan/projects/test.yml
key : LOG_DRIVER
value : syslog
filename : /home/ukrishnan/projects/test.yml
key : IMAGE
value : mysql:5.6
filename : /home/ukrishnan/projects/mysql/app.xml
key : ENV_ACCOUNT_BRIDGE_ENDPOINT
value : /u01/src/test/sample.txt

Converting JSON to .csv

I've found some data that someone is downloading into a JSON file (I think! - I'm a newb!). The file contains data on nearly 600 football players.
Here's the file: https://raw.githubusercontent.com/llimllib/fantasypl_stats/f944410c21f90e7c5897cd60ecca5dc72b5ab619/data/players.1426687570.json
Is there a way I can grab some of the data and convert it to .csv? Specifically the 'Fixture History'?
Thanks in advance for any help :)
Here is a solution using jq
If the file filter.jq contains
.[]
| {first_name, second_name, all:.fixture_history.all[]}
| [.first_name, .second_name, .all[]]
| #csv
and data.json contains the sample data then the command
jq -M -r -f filter.jq data.json
will produce the output (note only 10 rows shown here)
"Wojciech","Szczesny","16 Aug 17:30",1,"CRY(H) 2-1",90,0,0,0,1,0,0,0,0,0,1,0,13,7,0,55,2
"Wojciech","Szczesny","23 Aug 17:30",2,"EVE(A) 2-2",90,0,0,0,2,0,0,0,0,0,0,0,5,9,-9306,55,1
"Wojciech","Szczesny","31 Aug 16:00",3,"LEI(A) 1-1",90,0,0,0,1,0,0,0,1,0,2,0,7,15,-20971,55,1
"Wojciech","Szczesny","13 Sep 12:45",4,"MCI(H) 2-2",90,0,0,0,2,0,0,0,0,0,6,0,12,17,-39686,55,3
"Wojciech","Szczesny","20 Sep 15:00",5,"AVL(A) 3-0",90,0,0,1,0,0,0,0,0,0,2,0,14,22,-15931,55,6
"Wojciech","Szczesny","27 Sep 17:30",6,"TOT(H) 1-1",90,0,0,0,1,0,0,0,0,0,4,0,10,13,-5389,55,3
"Wojciech","Szczesny","05 Oct 14:05",7,"CHE(A) 0-2",90,0,0,0,2,0,0,0,0,0,1,0,3,9,-8654,55,1
"Wojciech","Szczesny","18 Oct 15:00",8,"HUL(H) 2-2",90,0,0,0,2,0,0,0,0,0,2,0,7,9,-824,54,1
"Wojciech","Szczesny","25 Oct 15:00",9,"SUN(A) 2-0",90,0,0,1,0,0,0,0,0,0,3,0,16,22,-11582,54,7
JSON is a more detailed data format than CSV - it allows for more complex data structures. Inevitably if you do this, you 'lose detail'.
If you want to fetch it automatically - that's doable, but I've skipped it because 'doing' https URLs is slightly more complicated.
So assuming you've downloaded your file, here's a possible solution in Perl (You've already got one for Python - both are very powerful scripting languages, but can pretty much cover the same ground - so it's as much a matter of taste as to which you use).
#!/usr/bin/perl
use strict;
use warnings;
use JSON;
my $file = 'players.json';
open( my $input, "<", $file ) or die $!;
my $json_data = decode_json(
do { local $/; <$input> }
);
foreach my $player_id ( keys %{$json_data} ) {
foreach my $fixture (
#{ $json_data->{$player_id}->{fixture_history}->{all} } )
{
print join( ",",
$player_id, $json_data->{$player_id}->{web_name},
#{$fixture}, "\n", );
}
}
Hopefully you can see what's going on here - you load the file $input, and decode_json to create a data structure.
This data structure is a nested hash (perl's term for the type of data structure). hashes are key-value pairs.
So we extract the keys from this hash - which is the ID number right at the beginning of each entry.
Then we loop through each of them - extracting the the fixture_history array. And for each element in that array, we print the player ID, their web_name and then the data from fixture_history.
This gives output like:
1,Szczesny,10 Feb 19:45,25,LEI(H) 2-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-2413,52,0,
1,Szczesny,21 Feb 15:00,26,CRY(A) 2-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-2805,52,0,
1,Szczesny,01 Mar 14:05,27,EVE(H) 2-0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1862,52,0,
1,Szczesny,04 Mar 19:45,28,QPR(A) 2-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1248,52,0,
1,Szczesny,14 Mar 15:00,29,WHU(H) 3-0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1897,52,0,
Does this make sense?
Python has some good libraries for doing this. If you copy the following code into a file and save it as fix_hist.py or something, then save your JSON file as file.json in the same directory, it will create a csv file with the fixture histories each saved as a row. Just run python fix_hist.py in your command prompt (or terminal for mac):
import csv
import json
json_data = open("file.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
f.writerow(j)
json_data.close()
To add additional data to the fixture history, you can add insert statements before writing the rows:
import csv
import json
json_data = open("file.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
arr = []
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
try:
j.insert(0,str(data[i]["first_name"]))
except:
j.insert(0,'error')
try:
j.insert(1,data[i]["web_name"])
except:
j.insert(1,'error')
try:
f.writerow(j)
except:
f.writerow(['error','error'])
json_data.close()
With insert(), just indicate the position in the row you want the data point to occupy as the first argument.

JSON to fixed width file

I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out