SAS libname JSON engine -- Twitter API - json

I'd like to use the SAS libname JSON engine instead of PROC GROOVY to import the JSON file I get from the Twitter API. I am running SAS 9.4M4 on OpenSuse LEAP 42.3.
I followed Falko Schulz's description in how to access the Twitter API and everthing worked out fine. Up to the point at which I wanted to import the JSON file into SAS. So the last working line of code is:
proc http method="get"
out=res headerin=hdrin
url="https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.%nrstr(&)count=1"
ct="application/x-www-form-urlencoded;charset=UTF-8";
run;
which yields a json-file in the file referenced with the filename "res".
Falko Schulz uses PROC GROOVY. In SAS 9.4M4, however, there is this mysterious JSON libname engine that makes life easier. And it works for simple JSON files. But not for the Twitter data. So having the JSON data from Twitter downloaded, using
libname test JSON filref=res;
gives me the following error:
Invalid JSON in input near line 1 column 751: Some code points did not
transcode.
I suspected that something is wrong with the encoding of the files so I used a filename statement of the form:
filename res TEMP encoding="utf-8";
without luck...
I also tried to increase the record length
filename res TEMP encoding="utf-8" lrecl=1000000;
and played around with the record format... to no avail...
Can somebody help? What am I missing? How can I use the JSON engine in a LIBNAME statement without running into this error?

Run your SAS session in UTF-8 mode, if you're inputting UTF-8 files into SAS datasets. While it's possible to run SAS in another mode and still read UTF-8 encoded files to some extent, you will generally have a lot of difficulties.
You can tell what encoding your session is in with this code:
proc options option=encoding;
run;
If it returns this:
ENCODING=WLATIN1 Specifies the default character-set encoding for the SAS session.
Then you're not in UTF-8 encoding.
SAS 9.4 and later on the desktop are typically installed with UTF-8 option automatically selected in addition to the default WLATIN1 (when installed in English, anyway). You can find it in the start menu under SAS 9.4 (Unicode Support), or by using the sasv9.cfg file in the 9.4\nls\u8\ subfolder of your SAS Foundation folder. Other earlier versions may also have that subfolder/language installed, but it was not always default to have it installed.

Related

Malformed records are detected in schema inference parsing json

I have a really frustrating error trying to parse basic Json read from Blob Storage using a data set within ADF
My Json is below
[{"Bid":0.197514880839,"BaseCurrency":"AED"}
,{"Bid":0.535403560434,"BaseCurrency":"AUD"}
,{"Bid":0.351998712241,"BaseCurrency":"BBD"}
,{"Bid":0.573128306234,"BaseCurrency":"CAD"}
,{"Bid":0.787556605631,"BaseCurrency":"CHF"}
,{"Bid":0.0009212964,"BaseCurrency":"CLP"}
,{"Bid":0.115389497248,"BaseCurrency":"DKK"}
]
I have tried all 3 Json source settings and every one of them gives the error
Malformed records are detected in schema inference. Parse Mode: FAILFAST
The 3 settings as in
Single Document
Array Of Documents
Document Per Line
Can anyone help? I just simply need this to be a list of objects thats it!
Paul
It should work for the JSON setting - Array of documents.
We face this issue when we have the json file with UTF-8 BOM encoding, the ADF DataFlow is unable to parse such files. You can specify the encoding as UTF-8 without encoding while creating the file, it will work.
In my case, I am using copy activity to merge and create the json file and have specified encoding as UTF-8 without BOM, and it resolved my issue.
Note: For some reason, we cant use the dataset which has "UTF-8 without BOM" encoding in DataFlow, in that case, you can create two datasets one with default UTF-8 encoding (which will be used in DataFlow) and one with UTF-8 without BOM(which will be used in copy activity sink/while creating a file).
Thank you.

Check the CSV file encoding in Data Factory

I am implementing a pipeline to move csv files from one folder to another in a data lake with the condition that the CSV file is encoded in UTF8.
Is it possible to check the encoding of a csv file directly in data factory/data flow?
Actually, the encoding is set in the connection conditions of the dataset. What happens in this case, if the encoding of the csv file is different?
What happens at the database level if the csv file is staged with a wrong encoding?
Thank you in advance.
Just for now, we can't check the file encoding in Data Factory/Data Flow directly. We must per-set the encoding type to read/write test files:
Ref: https://learn.microsoft.com/en-us/azure/data-factory/format-delimited-text#dataset-properties
The Data Factory default file encoding is UTF-8.
Like #wBob said, you need to achieve the encoding check in code level, like Azure Function or Notebook and so on. Call these actives in pipeline.
HTH.

Freemarker CSV generation - CSV with Chinese text truncates the csv contents

I have this very weird problem. I'm using Java 8, Struts2 and Freemarker 2.3.23 to generate reports in csv and html file formats (via.csv.ftl and .html.ftl templates both saved in utf-8 encoding), with data coming from postgres database.
The data has chinese characters in it and when I generate the report in html format, it is fine and complete and chinese characters are displayed properly. But when the report is generated in csv, I have observed that:
If I run the app with -Dfile.encoding=UTF-8 VM option, the chinese characters are generated properly but the report is incomplete (i.e. the texts are truncated specifically on the near end part)
If I run the app without -Dfile.encoding=UTF-8 VM option, the chinese characters are displayed in question marks (?????) but the report is complete
Also, the app uses StringWriter to write the data to the csv and html templates.
So, what could be the problem? Am I hitting Java character limits? I do not see error in the logs either. Appreciate your help. Thanks in advance.
UPDATE:
The StringWriter returns the data in whole, however when writing the data to the OutputStream, this is where some of the data gets lost.
ANOTHER UPDATE:
Looks like the issue is on contentLength (because the app is a webapp and csv is generated as file-download type) being generated from the data as Strings using String.length(). The String.length() method returns less value when there should be more. Maybe it has something to do with the chinese characters that's why length is being reported with less value.
I was able to resolve the issue with contentLength by using String.getBytes("UTF-8").length

Trying to load a UTF-8 CSV file with a flat file source in SSIS, keep getting errors saying it is a ANSI file format

I have a SSIS data flow task that reads from a CSV file and stores the results in a table.
I am simply loading the CSV file by rows (not even seperating the columns) and dumpting the entire row to the database, very simple process.
The file contains UTF-8 characters, and the file also has the UTF BOM already as I verified this.
Now when I load the file using a flat file connection, I have the following settings currently:
Unicode checked
Advanced editor shows the column as "Unicode text stream DT_NTEXT".
When I run the package, I get this error:
[Flat File Source [16]] Error: The data type for "Flat File
Source.Outputs[Flat File Source Output].Columns[DataRow]" is DT_NTEXT,
which is not supported with ANSI files. Use DT_TEXT instead and
convert the data to DT_NTEXT using the data conversion component.
[Flat File Source [16]] Error: Unable to retrieve column information
from the flat file connection manager.
It is telling me to use DT_TEXT but my file is UTF-8 and it will loose its encoding right? Makes no sense to me.
I have also tried with the Unicode checkbox unchecked, and setting the codepage to "65001 UTF-8" but I still get an error like the above.
Why does it say my file is an ANSI file?
I have opened my file in sublime text and saved it as UTF-8 with BOM. My preview of the flat file does show other languages correctly like Chinese and English combined.
When I didn't check Unicode, I would also get this error saying the flat files error output column is DT_TEXT and when I try and change it to Unicode text stream it gives me a popup error and doesn't allow me to do this.
I have faced this same issue for years, and to me it seems like it could be a bug with the Flat File Connection provider in SQL Server Integration Services (SSIS). I don't have a direct answer to your question, but I do have a workaround. Before I load data, I convert all UTF-8 encoded text files to UTF-16LE (Little Endian). It's a hassle, and the files take up about twice the amount of space uncompressed, but when it comes to loading Unicode into MS-SQL, UTF-16LE just works!
With regards to the actual conversion step I would say that is for you to decide what will work best in your workflow. When I have just a few files then I convert them one-by-one in a text editor, but when I have a lot of files then I use PowerShell. For example,
Powershell -c "Get-Content -Encoding UTF8 'C:\Source.csv' | Set-Content -Encoding Unicode 'C:\UTF16\Source.csv'"

How to convert sas7bdat file to csv?

I want to convert a .sas7bdat file to a .csv/txt format so that I can upload it into a hive table.
I'm receiving the .sas7bdat file from an outside server and do not have SAS on my machine.
Use one of the R foreign packages to read the file and then convert to CSV with that tool.
http://cran.r-project.org/doc/manuals/R-data.pdf
Pg 12
Using the SAS7BDAT package instead. It appears to ignore custom formatted, reading the underlying data.
In SAS:
proc format;
value agegrp
low - 12 = 'Pre Teen'
13 -15 = 'Teen'
16 - high = 'Driver';
run;
libname test 'Z:\Consulting\SAS Programs';
data test.class;
set sashelp.class;
age2=age;
format age2 agegrp.;
run;
In R:
install.packages(sas7bdat)
library(sas7bdat)
x<-read.sas7bdat("class.sas7bdat", debug=TRUE)
x
The python package sas7bdat, available here, includes a library for reading sas7bdat files:
from sas7bdat import SAS7BDAT
with SAS7BDAT('foo.sas7bdat') as f:
for row in f:
print row
and a command-line program requiring no programming
$ sas7bdat_to_csv in.sas7bdat out.csv
I recently wrote this package that allows you convert sas7bdat to csv using Hadoop/Spark. It's able to split giant sas7bdat file thus achieving high parallelism. The parsing also uses parso as suggested by #Ashpreet
https://github.com/saurfang/spark-sas7bdat
If this is a one-off, you can download the SAS system viewer for free from here (after registering for an account, which is also free):
http://support.sas.com/downloads/package.htm?pid=176
You can then open the sas dataset using the viewer and save it as a csv file. There is no CLI as far as I can tell, but if you really wanted to you could probably write an autohotkey script or similar to convert SAS datasets to csv.
It is also possible to use the SAS provider for OLE DB to read SAS datasets without actually having SAS installed, and that's available here:
http://support.sas.com/downloads/browse.htm?fil=0&cat=64
However, this is rather complicated - some documentation is available here if you want to get an idea:
http://support.sas.com/documentation/cdl/en/oledbpr/59558/PDF/default/oledbpr.pdf
Thanks for your help. I ended us using the parso utility in java and it worked like a charm. The utility returns the rows as object arrays which i wrote into a text file.
I referred to the utility from: http://lifescience.opensource.epam.com/parso.html