Managing a large SPSS (*.sav) file (4.2 GB)

Managing a large SPSS (*.sav) file (4.2 GB) - csv

I have received an SPSS file from survey fielded by another company that allegedly only contains ~1500 respondents, but the file size somehow has ballooned 4.2GB. My hunch is that the reason for this is that the file was from a global survey and the 1500 records that have been selected are from the US only so there are a series of blank variables, metadata for those variables that are included in this file and may also be in multiple languages/alphabets.
I only need a subset of this data, and can likely work with it if I removed the metadata but my issue has been that I can't get the damn thing open to cut down on the number of variables. I have been using the tools at my disposal to try the following workarounds, though I'm sure there are better options:
Opening the file using PSPP (freeware SPSS) - this causes the PSPP to stop responding
Using the R command read.spss (from the foreign package) to write a .csv - this claims that the file has a duplicate variable name and won't proceed further
Using the R command spss.system.file to write a .csv - when I tried this, R has spend a lot of time thinking as it as it attempts to run this and has been running for a couple hours with no apparent success.
Using the PSPP text conversion tool (https://pspp.benpfaff.org/) to create either a dictionary or a .csv file - both of these options crash after the file has completed uploading.
I've gone back to the other company to try have them work on reducing the file size, however I wasn't sure if anyone else had any ideas to do either of the following:
Open the file using another program/converter that could turn it into a .csv or other similarly skinny file format
Use another program to at least read only the variable names included in the file so that I can provide the other company with the specific variables I need

The following command from PSPP should do what you need:
$ pspp-convert originalFile.sav output.csv
In case it doesn't, please provide terminal error message.

Related

ADF Merge-Copying JSON files in Copy Data Activity creates error for Mapping Data Flow

I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.

don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.

According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.

Edit a large JSON file

How can I edit a large JSON manually?
I have a large JSON file, about 100 MB. I'd like to manually inspect some attributes, and then add more attributes to some of the objects.
I'd start off by looking at a subset of the file. Say, the 1st 100 objects. I'd gradually scale up to looking then at maybe 250, then a thousand, etc.
Can someone suggest a language or software (I'm running Windows) that excels at this task?
Some previous suggestion that aren't working or can't work.
Sublime - Could never load the file. Loading bar forever. Had to kill.
NotePad++ - Could never load. Froze. Had to kill.
Anything online - The data is confidential.
More Python and Jupyter information.
with open(path, 'r') as f:
data = json.load(f)
for i, (k, v) in enumerate(data.items()):
print(i, k, v)
if i == 2:
break
Causes an error. I think it has to do with Jupyter, but I'm not sure.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
That makes me wonder if going about it this way is just dumb.
Possible Solutions
Build a custom app using TKinter
Just don't use a Jupyter Notebook

What you can do is to write a simple GUI program. use TKinter, to create a window and a text area inside it to show the json, a text box where you will input, how many objects you want to see, and a button named Next or something to see next and one more button to save.The following will be the functionalities for each of the items.
First you will be reading complete json in python and making it a dict.
Next Button - This will keep iterating based on the value in the TextBox. you could write a custom generator, where it will be yielding based on the number of values required.
Save Button-: This will keep saving the current json into a new json or if you could, you can try to write a function to update the current json directly.
Text Area - you should take the dictionary and convert to json and show the output from the Next Button's generator.

If you are using linux (or have an opportunity to transfer the file to *nix) you might wish to check out for number of lines within a file via
wc -l myfile.json
Let's say, for the purpose of simplicity, that your file has 2530000 lines and you wish to split it into 100k lines each, you can utilize any of the commands available at your distro to split the file further into desired chunks and then to edit them, one by one.
If you are comfortable with going the "linux way", check out some of the hints given on other topics, i.e.
edit multi-GB file when vi editor doesn't work
I hope it helps!

The only viewer I have used that works on large files (I had up to 250MB size files) is Dadroit. It is fast to view and comes with search.
Now, to edit, I use vi. I search for the location and make local edits. Vim or another simpler editor should work on Windows. Have you tried vscode? 100MB shouldn't be too large for it.
The other awesome terminal tool for viewing and editing data is Visidata. I have had mixed luck with it working on json files.

Not the best answer, but the problem with reading the JSON seems limited to Jupyter Notebooks (or even the limitations of my laptop).
Working in Spyder or running from the command line circumvents the Jupyter error mentioned in the original question.
It'd be great if someone knew how to tweak Jupyter to avoid this problem (sorry, I'm not sure how yet).

for editor,try notepad++
for language, try Python
since you haven't give your data structure, I can't give more answer.

MySQL Workbench 6.1 - Error importing recordset

I'm going to be getting a new computer soon and I don't want to lose all of the data I have entered in my tables, so I decided to test out the feature that allows you to export and import CSV files. I exported a table successfully (data was transferred to Microsoft Excel in CSV file), but when I opened the file in Microsoft Excel and added a few rows and tried to import it back in to MySQL Workbench, I got the following error:
"Error importing recordset
error calling Python module function
SQLIDEUtils.importRecordsetDataFromFile"
I've searched all over for info on this, but can't find any solutions. Does anyone know what I'm doing wrong?

In Workbench, open a MySQL connection and then navigate to [Server] --> [Data Export]. There are several backup options here, including saving the data as an individual file or folder. Choose the databases you want to export, and then click [Start Export].
If you ever prefer using Excel for editing and such, then use the MySQL for Excel plugin to access MySQL databases from within Excel. However, I don't think you need it here.

To export your mySQL data, use mysqldump, which will create all the schema for you.
Excel probably added some stuff to your file and now mySQL can't understand it. The best way to find out is by comparing the files before and after the change.

That error indicates a format problem. If the file is small enough, try opening it in wordpad (or the mac equivalent) and see if there's any difference in the formatting? Could be that the delimiting got a little messed up (this can happen especially with end of row markers in MySQL, I've noticed, it can also happen in mac to pc handoffs). If all else fails you could try exporting using a different format and see if that makes a difference (maybe tsv) when you add new rows.

Another reason can be the line endings used. Depending on the system and editor used to work with the cvs file it the line endings might get changed. For me mysql supported UNIX line endings. And in the editor the line ending had been set to MAC OS 9 since I was using a MAC.
Changing it to UNIX line ending worked.

I found that it might be due to a wrong encoding of the input file.
Using Notepad++ for example (or another similar editor) you need to change file encoding to UTF-8.

tiffcp.exe merging a results file with a results file in a loop

I am building a web app that takes several tiff image files and merges them together into one single tiff image file using GNUWin32 tiffcp.exe from command line.
The way I was doing it was to loop through the file list and build a string of file names to merge into one single variable.
strfileList = "c:folder\folder\folder\aased98-def-wsdeff-434fsdsd-dvv.tif c:folder\folder\folder\aased98-def-wsdeff-434fsdsd-axs.tif c:folder\folder\folder\aased98-def-wsdeff-434fsdsd-dxzs.tif"
Then I would just write to the command line:
tiffcp.exe strFileList results.tif
The file names are guids and so the paths are fairly long and I do not have any control to shorten them. So if I have a bunch of these documents (over 20 files or so), the length of the string variable exceeds the limits for windows command line and the merge fails.
Since this process is just merging files, my next thought was instead of writing the file names to a string, just do the merge one file at a time. So the first time the loop runs the following type of code:
tiffcp.exe file1.tif results.tif
The result is a perfect 476k tif file. But the next iteration of the loop needs to merge the second file plus the contents of the first "results" tif file. So I do this:
tiffcp.exe results.tif file2.tiff results.tif
The results each time are a blank 1K tiff file?
All the examples I can find of tiffcp.exe say file1.tif file2.tif results.tif, none use the results file to write back to itself?
Any suggestions on how to do this?

Try the -a switch to tiffcp.exe
I'm doing something similar in Python and inside my file processing loop I'm issuing the command:
tiffcpp.exe -a temp.tif output.tif
works fine.

For an ASP.NET project you may want to try LibTiff.Net (free, open source, BSD license). That port of libtiff library contains tiffcp utility with source code. You may try to use it in your code.
Disclaimer: I am one of the maintainers of the library.

I believe your problem is caused by the use of results.tif as both input as output. If you increment the file name (i.e. results1.tif to results2.tif etc.) I believe it should work.
This is a rather inefficient approach (tiff1 is copied 9 times if you have 10 files). Since you refer to libtiff, you may take a look at the source of libtiff cp and check if it is worthwhile to embed it.

How can I add file locations to a database after they are uploaded using a Perl CGI script?

I have a CGI program I have written using Perl. One of its functions is to upload pics to the server.
All of it is working well, including adding all kinds of info to a MySQL db. My question is: How can I get the uploaded pic files location and names added to the db?
I would rather that instead of changing the script to actually upload the pics to the db. I have heard horror stories of uploading binary files to databases.
Since I am new to all of this, I am at a loss. Have tried doing some research and web searches for 3 weeks now with no luck. Any suggestions or answers would be greatly appreciated. I would really hate to have to manually add all the locations/names to the db.
I am using: a Perl CGI script, MySQL db, Linux server and the files are being uploaded to the server. I AM NOT looking to add the actual files to the db. Just their location(s).

It sounds like you have your method complete where you take the upload, make it a string and toss it unto mysql similar to reading file in as a string. However since your given a filehandle versus a filename to read by CGI. You are wondering where that file actually is.
If your using CGI.pm, the upload, uploadInfo, the param for the upload, and upload private files will help you deal with the upload file sources. Where they are stashed after the remote client and the CGI are done isn't permanent usually and a minimum is volatile.

You've got a bunch of uploaded files that need to be added to the db? Should be trivial to dash off a one-off script to loop through all the files and insert the details into the DB. If they're all in one spot, then a simple opendir()/readdir() type loop would catch them all, otherwise you can make a list of file paths to loop over and loop over that.
If you've talking about recording new uploads in the server, then it would be something along these lines:
user uploads file to server
script extracts any wanted/needed info from the file (name, size, mime-type, checksums, etc...)
start database transaction
insert file info into database
retrieve ID of new record
move uploaded file to final resting place, using the ID as its filename
if everything goes file, commit the transaction
Using the ID as the filename solves the worries of filename collisions and new uploads overwriting previous ones. And if you store the uploads somewhere outside of the site's webroot, then the only access to the files will be via your scripts, providing you with complete control over downloads.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008