JSON-file without line breaks, cant import file to SAS - json

I have a large json file (250 Mb) that has no line breaks in it when opening the file in notepad or SAS. But if I open it in Wordpad, I get the correct line breaks. I suppose this could mean the json file uses unix line breaks, which notapad can't read, but wordpad can read, from what I have read.
I need to import the file to SAS. One way of doing this migth be to open the file in wordpad, save it as a text file, which will hopefully retain the correct line breaks, so that I can read the file in SAS. I have tried reading the file, but without line breaks, I only get the first observation, and I can't get the program to find the next observation.
I have tried getting wordpad to save the file, but wordpad crashes each time, probably because of the file size. Also tried doing this through powershell, but can't figure out how to save the file once it is opened, and I see no reason why it should work seeing as wordpad crashes when i try it through point and click.
Is there another way to fix this json-file? Is there a way to view the unix code for line breaks and replace it with windows line breaks, or something to that effect?
EDIT:
I have tried adding the TERMSTR=LF option both in filename and infile, without any luck:
filename test "C:\path";
data datatest;
infile test lrecl = 32000 truncover scanover TERMSTR=LF;
input #'"Id":' ID $9.;
run;
However, If I manually edit a small portion of the file to have line breaks, it works. The TERMSTR option doesn't seem to do much for me
EDIT 2:
Solved using RECFM=F
data datatest;
infile test lrecl = 42000 truncover scanover RECFM=F ;
input #'"Id":' ID $9.;
run;
EDIT 3:
Turn out it didnt solve the problem after all. RECFM=F means all records have a fixed length, which they don't, so my data gets mixed up and a lot of info is skipped. Tried RECFM=V(ariable), but this is not working either.

I guess you're using windows, so try:
TYPE input_filename | MORE /P > output_filename
this should replace unix style text file with windows/dos one.

250 Mbytes is not too long to treat as a single record.
data want ;
infile json lrecl=250000000; *250 Mb ;
input #'"Id":' ID :$9. ##;
run;

Related

Invalid literal because symbol appears when reading a csv file

When I am using replit I can remove the little symbol that appears when I drag and drop in a csv file so my main.py can read it, otherwise I get invalid literal base 10 issue. I am trying to run this on local machine with sublime text and getting same error now as it is reading the file from the directory, so I assume it is adding this symbol in before reading.... I can click on the csv file in replit and edit, but cannot do this in sublime.
Can someone explain what this is for? HOw can I get it to read the basic comma delimited numbers in the file (It is a game tile map).
with open(f'level{level}_data.csv', newline= '') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
Saved it is comma delimited csv instead of UTF-8 comma delimited csv. It then imports without the 'question mark in a diamon' symbol. I understand this is an unrecognised special character, but I have nothing apart from integers in my table. Maybe someone could clarify that?...

Premature end of line Weka error

I'm new to Weka and I have to use it for a University project. So, I created a .csv file and when I try to upload it to Weka, it says: "not recognised as a CSV data file. Reason: 1 problem encountered on line 2".
Then, if I open the .csv file with Notepad and then save as .arff file, when I try to open it again with Weka, in this case I get another error message: "not recognised as an arff data file. Reason: premature end of line, read Token[EOL], line 8".
Please help, I don't know much about working with Weka and really don't know what could be the problem, even though I did a lot of research about this problem.
This is the file: https://app.box.com/s/adfpf1zatgpl5mo20u5hdd1gnqihnq40
#Relation "PIB_Rata inflatiei"
#Attribute "PIB" NUMERIC
#Attribute "Rata_inflatiei" NUMERIC
#Data
30624.3,20780.9,27980.4,31920.3,37657.0,37168.3,35838.9,41978.0,36183.4,37439.0,40717.1,46174.0,59867.6,76217.6,99699.2,123533.7,171540.2,208185.1,167421.6,167998.1,185362.3,171664.6,191548.1,199325.9,177956.0
128.0,211.2,255.2,136.8,32.2,38.8,154.8,59.1,45.8,45.7,34.5,22.5,15.3,11.3,9.0,6.6,4.8,7.8,5.6,6.1,5.8,3.3,4.0,1.1,-0.6
In the ARFF format (as well as CSV) instances are rows, and attributes are columns.
Your file thus has too many columns, ever row must have exactly.two values.

Not able to load CSV file in weka

I am not being able to load csv file using weka, I have removed each and every special symbol even using text editor, still no luck. I am attaching the file, I will be obliged if solve this problem.
It shows "Wrong number of values, Read 31, expected 27, read token[EOL], line 3"
link : https://drive.google.com/open?id=0By7zyIPDD6HJMmthWnZLSUk5aFE
You have planty of empty fields in your file and if you download it as .csv even the header gets three commas at its end.
e.g. your 6th line:
,Doug Walker,,,131,,Rob Walker,131,,Documentary,Doug Walker,Star Wars: Episode VII The Force Awakens  ,8,143,,0,,,,,,,,,12,7.1,,0,,,
Simmilar to the suggestion in this post you could try s.th. like notepad++ or another text editor to replace ",," by ",?," to fill up your gaps.
Convert NA values to ? automatically while loading
I did this and then you get in your first row two question marks as column names wich obviously doesnt work, so change the first row to look like this:
color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,?,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,additionalColName1,additionalColName2,additionalColName3
if you try now to import your data weka starts telling you which lines it doesn't like and why. Btw. you did not "removed each and every special symbol"!
After removing a fiew lines with e.g. the Ç character it worked.
Thats just an ugly workaround, try filling the empty values and find a regular expression or a better way to save your file to remove the last three commas of every line, i was just too lazy for now. But i could load it into weka and that's what you wanted (:

Remove all binary characters from a file

Occasionally, I have a hard time manipulating data in a CSV file because of the following error.
Binary file (standard input) matches
I researched several articles online but cannot seem to find one that helps me remove all of the binary characters or elements from a CSV file.
Unfortunately, I do not know where to start with this.
If I run the 'file' command on the file, I get the following output:
Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminators
The second from last line in the file prints as:
"???? ?????, ???? ???",????,"?????, ????",???,,,,,,,,,,,,,,,,,,,,,,,,* Home,email#address.com,,
The second line in the file prints as:
,,,,,,,,,,,,,,,,,,,,,,,,,,,* ,email#address.com,,
This file contains too many lines to open in Excel or a GUI, "Save as..." and remove the binary elements that way.
Please help me. Thank you!

CSV to SAS dataset: no line-final comma causes problems

I'm trying to import a .CSV file into a SAS dataset, and am having some trouble. Here's a line of sample input:
Foo,5,10,3.5
Bar,2,3,1.0
The problem I'm having is that the line-final "3.5" and "1.0" are not being correctly interpreted as variable values (instead SAS complains that they are invalid values, giving me a NOTE: Invalid data for VARIABLE error). However, when I add a comma to the end of the line, like so:
Foo,5,10,3.5,
Bar,2,3,1.0,
Then everything works fine. Is there a way that I can make this import work without modifying the source file?
Currently, my DATA step's INFILE statement has the DSD, DLM=',', and MISSOVER options.
With this data in a .csv file in a windows environment
Foo,5,10,1.5
Bar,2,3,2.1
Foo,5,10,3.5
Bar,2,3,4.1
This code works (running SAS locally on a windows machine)
filename f 'D:\Data\SAS\input.csv';
data input;
infile f delimiter=',';
input char1 $ num1 num2 num3;
Run;
As #itzy mentioned, the environment is important..more info will help with the solution
When you are working with data from a different environment, you can use the TERMSTR option on the INFILE statement to tell SAS how the lines of data are terminated.
This most likely has to do with the different codes for line endings in Unix and Windows. I'm guessing your data comes from a different operating system than the one you're running SAS on.
The solution is to change the newline codes to the correct operating system. If you're running SAS on a unix system, try the dos2unix command. If you're running Windows, you can edit the CSV file with a text editor like UltraEdit or Notepad++ and save the file in Windows format.