How to convert sas7bdat file to csv? - csv

I want to convert a .sas7bdat file to a .csv/txt format so that I can upload it into a hive table.
I'm receiving the .sas7bdat file from an outside server and do not have SAS on my machine.

Use one of the R foreign packages to read the file and then convert to CSV with that tool.
http://cran.r-project.org/doc/manuals/R-data.pdf
Pg 12
Using the SAS7BDAT package instead. It appears to ignore custom formatted, reading the underlying data.
In SAS:
proc format;
value agegrp
low - 12 = 'Pre Teen'
13 -15 = 'Teen'
16 - high = 'Driver';
run;
libname test 'Z:\Consulting\SAS Programs';
data test.class;
set sashelp.class;
age2=age;
format age2 agegrp.;
run;
In R:
install.packages(sas7bdat)
library(sas7bdat)
x<-read.sas7bdat("class.sas7bdat", debug=TRUE)
x

The python package sas7bdat, available here, includes a library for reading sas7bdat files:
from sas7bdat import SAS7BDAT
with SAS7BDAT('foo.sas7bdat') as f:
for row in f:
print row
and a command-line program requiring no programming
$ sas7bdat_to_csv in.sas7bdat out.csv

I recently wrote this package that allows you convert sas7bdat to csv using Hadoop/Spark. It's able to split giant sas7bdat file thus achieving high parallelism. The parsing also uses parso as suggested by #Ashpreet
https://github.com/saurfang/spark-sas7bdat

If this is a one-off, you can download the SAS system viewer for free from here (after registering for an account, which is also free):
http://support.sas.com/downloads/package.htm?pid=176
You can then open the sas dataset using the viewer and save it as a csv file. There is no CLI as far as I can tell, but if you really wanted to you could probably write an autohotkey script or similar to convert SAS datasets to csv.
It is also possible to use the SAS provider for OLE DB to read SAS datasets without actually having SAS installed, and that's available here:
http://support.sas.com/downloads/browse.htm?fil=0&cat=64
However, this is rather complicated - some documentation is available here if you want to get an idea:
http://support.sas.com/documentation/cdl/en/oledbpr/59558/PDF/default/oledbpr.pdf

Thanks for your help. I ended us using the parso utility in java and it worked like a charm. The utility returns the rows as object arrays which i wrote into a text file.
I referred to the utility from: http://lifescience.opensource.epam.com/parso.html

Related

Best data processing software to parse CSV file and make API call per row

I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.

ArangoDB: How to export collection to CSV?

I have noticed there is a feature in web interface of ArangoDB which allows users to Download or Upload data as JSON file. However, I find nothing similar for CSV exporting. How can an existing Arango DB collection be exported to a .csv file?
If you want to export data from ArangoDB to CSV, then you should use Arangoexport. It is included in the full packages as well as the client-only packages. You find it next to the arangod server executable.
Basic usage:
https://docs.arangodb.com/3.4/Manual/Programs/Arangoexport/Examples.html#export-csv
Also see the CSV example with AQL query:
https://docs.arangodb.com/3.4/Manual/Programs/Arangoexport/Examples.html#export-via-aql-query
Using an AQL query for a CSV export allows you to transform the data if desired, e.g. to concatenate an array to a string or unpack nested objects. If you don't do that, then the JSON serialization of arrays/objects will be exported (which may or may not be what you want).
The default Arango install includes the following file:
/usr/share/arangodb3/js/contrib/CSV_export/CSVexport.js
It includes this comment:
// This is a generic CSV exporter for collections.
//
// Usage: Run with arangosh like this:
// arangosh --javascript.execute <CollName> [ <Field1> <Field2> ... ]
Unfortunately, at least in my experience, that usage tip is incorrect. Arango team, if you are reading this, please correct the file or correct my understanding.
Here's how I got it to work:
arangosh --javascript.execute "/usr/share/arangodb3/js/contrib/CSV_export/CSVexport.js" "<CollectionName>"
Please specify a password:
Then it sends the CSV data to stdout. (If you with to send it to a file, you have to deal with the password prompt in some way.)

SAS libname JSON engine -- Twitter API

I'd like to use the SAS libname JSON engine instead of PROC GROOVY to import the JSON file I get from the Twitter API. I am running SAS 9.4M4 on OpenSuse LEAP 42.3.
I followed Falko Schulz's description in how to access the Twitter API and everthing worked out fine. Up to the point at which I wanted to import the JSON file into SAS. So the last working line of code is:
proc http method="get"
out=res headerin=hdrin
url="https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.%nrstr(&)count=1"
ct="application/x-www-form-urlencoded;charset=UTF-8";
run;
which yields a json-file in the file referenced with the filename "res".
Falko Schulz uses PROC GROOVY. In SAS 9.4M4, however, there is this mysterious JSON libname engine that makes life easier. And it works for simple JSON files. But not for the Twitter data. So having the JSON data from Twitter downloaded, using
libname test JSON filref=res;
gives me the following error:
Invalid JSON in input near line 1 column 751: Some code points did not
transcode.
I suspected that something is wrong with the encoding of the files so I used a filename statement of the form:
filename res TEMP encoding="utf-8";
without luck...
I also tried to increase the record length
filename res TEMP encoding="utf-8" lrecl=1000000;
and played around with the record format... to no avail...
Can somebody help? What am I missing? How can I use the JSON engine in a LIBNAME statement without running into this error?
Run your SAS session in UTF-8 mode, if you're inputting UTF-8 files into SAS datasets. While it's possible to run SAS in another mode and still read UTF-8 encoded files to some extent, you will generally have a lot of difficulties.
You can tell what encoding your session is in with this code:
proc options option=encoding;
run;
If it returns this:
ENCODING=WLATIN1 Specifies the default character-set encoding for the SAS session.
Then you're not in UTF-8 encoding.
SAS 9.4 and later on the desktop are typically installed with UTF-8 option automatically selected in addition to the default WLATIN1 (when installed in English, anyway). You can find it in the start menu under SAS 9.4 (Unicode Support), or by using the sasv9.cfg file in the 9.4\nls\u8\ subfolder of your SAS Foundation folder. Other earlier versions may also have that subfolder/language installed, but it was not always default to have it installed.

2 Hdfs files comparison

I have 6000+ .csv files in /hadoop/hdfs/location1 and 6100+ .csv files in /hadoop/hdfs/location2.
I want to compare these two hdfs directories and find the diff of files. The diff .csv files(non-similar) should be reflected in a 3rd hdfs directory(/hadoop/hdfs/location3). I am not sure we can use diff command as in unix to hdfs file system.
Any idea on how to resolve this would be appreciable.
Anshul
You could use some python (perl/etc.) script to check it. Depending on your special needs and speed, you could check for file-size first. Are the filenames identical? Are the creation-dates identical etc.?
If you want to use python, check out the filecmp module.
>>> import filecmp
>>> filecmp.cmp('undoc.rst', 'undoc.rst')
True
>>> filecmp.cmp('undoc.rst', 'index.rst')
False
Look at the below post which provides an answer on how to compare 2 HDFS files. You will need to extend this for 2 folders.
HDFS File Comparison
You could easily do this with the Java API and create a small app:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;
We don't have hdfs commands to compare the files.
Check below post we can achieve by writing the PIG Program or We need to Write Map Reduce Program.
Equivalent of linux 'diff' in Apache Pig
I think below steps will solve your problem:
Get the list of file names which are in first location into one file
Get the second location files into another file
Find the diff between two files using unix commands
Whatever the diff files you found, copy those files in the other location.
I hope this helps you. otherwise let me know.

Migrating from Lighthouse to Jira - Problems Importing Data

I am trying to find the best way to import all of our Lighthouse data (which I exported as JSON) into JIRA, which wants a CSV file.
I have a main folder containing many subdirectories, JSON files and attachments. The total size is around 50MB. JIRA allows importing CSV data so I was thinking of trying to convert the JSON data to CSV, but all convertors I have seen online will only do a file, rather than parsing recursively through an entire folder structure, nicely creating the CSV equivalent which can then be imported into JIRA.
Does anybody have any experience of doing this, or any recommendations?
Thanks, Jon
The JIRA CSV importer assumes a denormalized view of each issue, with all the fields available in one line per issue. I think the quickest way would be to write a small Python script to read the JSON and emit the minimum CSV. That should get you issues and comments. Keep track of which Lighthouse ID corresponds to each new issue key. Then write another script to add things like attachments using the JIRA SOAP API. For JIRA 5.0 the REST API is a better choice.
We just went through a Lighthouse to JIRA migration and ran into this. The best thing to do is in your script, start at the top-level export directory and loop through each ticket.json file. You can then build a master CSV or JSON file to import into JIRA that contains all tickets.
In Ruby (which is what we used), it would look something like this:
Dir.glob("path/to/lighthouse_export/tickets/*/ticket.json") do |ticket|
JSON.parse(File.open(ticket).read).each do |data|
# access ticket data and add it to a CSV
end
end