What I am starting with, is the postcode table from the netherlands. I split it up into a couple of csv files, containing for instance the city as subject, PartOf as predicate and municipality as object. This gives you this in a file:
city,PartOf,municipality
Meppel,PartOf,Meppel
Nijeveen,PartOf,Meppel
Rogat,PartOf,Meppel
Now I would like to get this data into MarkLogic. And I can import csv-files, I can import triples, but I can't figure out the combination.
I would suggest rewriting it slightly so it conforms to the N-Triples format, giving it the .nt extension, and then using MLCP to load it as input_type rdf.
HTH!
You can use Google Refine to convert CSV data to RDF. After that, MLCP can be used to push that data. You can do something like this -
$ mlcp.sh import -username user -password password -host localhost \
-port 8000 -input_file_path /my/data -mode local \
-input_file_type rdf
For more on loading triples using MLCP you can refer this MarkLogic Community Page
Related
I am extracting prosody features from an audio file while using Opensmile using Windows version of Opensmile. It runs successful and an output csv is generated. But when I open csv, it shows some rows that are not readable. I used this command to extract prosody feature:
SMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav -O prosody_sample1.csv
And the output of csv looks like this:
[
Even I tried to use the sample wave file given in Example audio folder given in opensmile directory and the output is same (not readable). Can someone help me in identifying where the problem is actually? and how can I fix it?
You need to enable the csvSink component in the configuration file to make it work. The file config\prosody\prosodyShs.conf that you are using does not have this component defined and always writes binary output.
You can verify that it is the standart binary output in this way: omit the -O parameter from your command so it becomesSMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav and execute it. You will get a output.htk file which is exactly the same as the prosody_sample1.csv.
How output csv? You can take a look at the example configuration in opensmile-3.0-win-x64\config\demo\demo1_energy.conf where a csvSink component is defined.
You can find more information in the official documentation:
Get started page of the openSMILE documentation
The section on configuration files
Documentation for cCsvSink
This is how I solved the issue. First I added the csvSink component to the list of the component instances. instance[csvSink].type = cCsvSink
Next I added the configuration parameters for this instance.
[csvSink:cCsvSink]
reader.dmLevel = energy
filename = \cm[outputfile(O){output.csv}:file name of the output CSV
file]
delimChar = ;
append = 0
timestamp = 1
number = 1
printHeader = 1
\{../shared/standard_data_output_lldonly.conf.inc}`
Now if you run this file it will throw you errors because reader.dmLevel = energy is dependent on waveframes. So the final changes would be:
[energy:cEnergy]
reader.dmLevel = waveframes
writer.dmLevel = energy
[int:cIntensity]
reader.dmLevel = waveframes
[framer:cFramer]
reader.dmLevel=wave
writer.dmLevel=waveframes
Further reference on how to configure opensmile configuration files can be found here
I've run a Spark job via databricks on AWS, and by calling
big_old_rdd.saveAsTextFile("path/to/my_file.json")
have saved the results of my job into an S3 bucket on AWS. The result of that spark command is a directory path/to/my_file.json containing portions of the result:
_SUCCESS
part-00000
part-00001
part-00002
and so on. I can copy those part files to my local machine using the AWS CLI with a relatively simple command:
aws s3 cp s3://my_bucket/path/to/my_file.json local_dir --recursive
and now I've got all those part-* files locally. Then I can get a single file with
cat $(ls part-*) > result.json
The problem is that this two-stage process is cumbersome and leaves file parts all over the place. I'd like to find a single command that will download and merge the files (ideally in order). When dealing with HDFS directly this is something like hadoop fs -cat "path/to/my_file.json/*" > result.json.
I've looked around through the AWS CLI documentation but haven't found an option to merge the file parts automatically, or to cat the files. I'd be interested in either some fancy tool in the AWS API or some bash magic that will combine the above commands.
Note: Saving the result into a single file via spark is not a viable option as this requires coalescing the data to a single partition during the job. Having multiple part files on AWS is fine, if not desirable. But when I download a local copy, I'd like to merge.
This can be done with a relatively simple function using boto3, the AWS python SDK.
The solution involves listing the part-* objects in a given key, and then downloading each of them and appending to a file object. First, to list the part files in path/to/my_file.json in the bucket my_bucket:
import boto3
bucket = boto3.resource('s3').Bucket('my_bucket')
keys = [obj.key for obj in bucket.objects.filter(Prefix='path/to/my_file.json/part-')]
Then, use Bucket.download_fileobj() with a file opened in append mode to write each of the parts. The function I'm now using, with a few other bells and whistles, is:
from os.path import basename
import boto3
def download_parts(base_object, bucket_name, output_name=None, limit_parts=0):
"""Download all file parts into a single local file"""
base_object = base_object.rstrip('/')
bucket = boto3.resource('s3').Bucket(bucket_name)
prefix = '{}/part-'.format(base_object)
output_name = output_name or basename(base_object)
with open(output_name, 'ab') as outfile:
for i, obj in enumerate(bucket.objects.filter(Prefix=prefix)):
bucket.download_fileobj(obj.key, outfile)
if limit_parts and i >= limit_parts:
print('Terminating download after {} parts.'.format(i))
break
else:
print('Download completed after {} parts.'.format(i))
The downloading part may be an extra line of code.
As far as cat'ing in order, you can do it according to time created, or alphabetically.
Combined in order of time created: cat $(ls -t) > outputfile
Combined & Sorted alphabetically: cat $(ls part-* | sort) > outputfile
Combined & Sorted reverse-alphabetically: cat $(ls part-* | sort -r) > outputfile
I performed a GWAS in PLINK and now I would like to look at the data for a small set of SNPs listed one for each line, in a file called snps.txt.
I would like to export the data from PLINK for theses specific SNPs into a .txt or .csv file. Ideally, this file would have the individual IDs as well as the genotypes for these SNPs so that I could later merge it with my phenotype file and perform additional analyses and plots.
Is there an easy way to do that? I know I can use --extract to request specific SNPs only but I can't find a way to tell PLINK to export the data to an "exportable" text-based format.
If you are using classic plink (1.07) you should consider upgrading to plink 1.9. It is a lot faster, and supports many more formats. This answer is for plink 1.9.
Turning binary plink data into a .csv file
It sounds like your problem is that you are unable to turn the binary data into a regular plink text file.
This is easy to do with the recode option. It should be used without any parameters to convert to the plink text format:
plink --bfile gwas_file --recode --extract snps.txt --out gwas_file_text
If you want to convert the .ped data to a csv afterwards you could do the following:
cut -d " " -f2-2,7- --output-delimiter=, gwas_file_text.ped
This produces a comma-delimited file with IDs in the first column and then genotypes.
Turning plink data into other text based file formats
Note that you can also convert the data to a lot of other text-based filetypes, all described in the docs.
One of these is the common variant call format (VCF), which makes a file with the snps and individual IDs all in one file, as requested:
plink --bfile gwas_file --recode vcf --extract snps.txt --out gwas_file_text
I have a working code for parsing a JSON output using KornShell by treating it as a string of characters. The issue I have is that the vendor keeps changing the position of the field that I am intersted in. I understand in JSON, we can parse it by key-value pairs.
Is there something out there that can do this? I am intersted in a specific field and I would like to use it to run the checks on the status of another RESTAPI call.
My sample json output is like this:
JSONDATA value :
{
"status": "success",
"job-execution-id": 396805,
"job-execution-user": "flexapp",
"job-execution-trigger": "RESTAPI"
}
I would need the job-execution-id value to monitor this job through the rest of the script.
I am using the following command to parse it:
RUNJOB=$(print ${DATA} |cut -f3 -d':'|cut -f1 -d','| tr -d [:blank:]) >> ${LOGDIR}/${LOGFILE}
The problem with this is, it is field delimited by :. The field position has been known to be changed by the vendors during releases.
So I am trying to see if I can use a utility out there that would always give me the key-value pair of "job-execution-id": 396805, no matter where it is in the json output.
I started looking at jsawk, and it requires the js interpreter to be installed on our machines which I don't want. Any hint on how to go about finding which RPM that I need to solve it?
I am using RHEL5.5.
Any help is greatly appreciated.
The ast-open project has libdss (and a dss wrapper) which supposedly could be used with ksh. Documentation is sparse and is limited to a few messages on the ast-user mailing list.
The regression tests for libdss contain some json and xml examples.
I'll try to find more info.
Python is included by default with CentOS so one thing you could do is pass your JSON string to a Python script and use Python's JSON parser. You can then grab the value written out by the script. An example you could modify to meet your needs is below.
Note that by specifying other dictionary keys in the Python script you can get any of the values you need without having to worry about the order changing.
Python script:
#get_job_execution_id.py
# The try/except is because you'll probably have Python 2.4 on CentOS 5.5,
# and the straight "import json" statement won't work unless you have Python 2.6+.
try:
import json
except:
import simplejson as json
import sys
json_data = sys.argv[1]
data = json.loads(json_data)
job_execution_id = data['job-execution-id']
sys.stdout.write(str(job_execution_id))
Kornshell script that executes it:
#get_job_execution_id.sh
#!/bin/ksh
JSON_DATA='{"status":"success","job-execution-id":396805,"job-execution-user":"flexapp","job-execution-trigger":"RESTAPI"}'
EXECUTION_ID=`python get_execution_id.py "$JSON_DATA"`
echo $EXECUTION_ID
I'm trying the export data form HDFS to Couchbase and I have a problem with my file format.
My configuration:
Couchbase server 2.0
Stack hadoop cdh4.1.2
sqoop 1.4.2 (compiled with hadoop2.0.0)
couchbase/hadoop connector (compiled with hadoop2.0.0)
When I run the export command, I can easily export files with this kind of format:
id,"value"
or
id,42
or
id,{"key":"value"}
But when I want to apply a Json object it doesn't work!
id,{"key1":"value1,"key2":"value2"}
The content is truncated at the first comma and diplay in base64 by couchbase because now the content is not a correct JSON...
So, my question is how the file must by formated to be stored as a json document?
Can we only export a key/value file?
I want to export json files form HDFS like the cbdocloader do it with files from local file system...
I'm afraid that this expected behavior as Sqoop is parsing your input file as CSV with comma as a separator. You might need tweak your file format to either escape separator or enclose entire JSON string. I would recommend reading how exactly Sqoop is dealing with escaping separators and enclosing strings in the user guide [1].
Links:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#id387098
I think your best bet is to convert the files to tab-delimited, if you're still working on this. If you look at the Sqoop documentation (http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_large_objects), there's an option --fields-terminated-by which allows you to specify which characters Sqoop splits fields on.
If you passed it --fields-terminated-by '\t', and a tab-delimited file, it would leave the commas in place in your JSON.
#mpiffaretti can you post your sqoop export command? I think each JSON object should have its own key value.
key1 {"dataOne":"ValueOne"}
key2 {"dataTwo":"ValueTwo"}
http://ajanacs.weebly.com/blog
In your case change the datea like below may help you solve the issue.
id,{"key":"value"}
id2,{"key2":"value2"}
Let me know if you have further questions on it.
[json] [sqoopexport] [couchbase]