I have multiple CSV files that I need to split into 67 separate files each. Each sheet has over a million rows and dozens of columns. One of the columns is called "Code" and it ranges from 1 to 67 which is what I have to base the split on. I have been doing this split manually by selecting all of the rows within each value (1, 2, 3, etc) and pasting them into their own CSV file and saving them, but this is taking way too long. I usually use ArcGIS to create some kind of batch file split, but I am not having much luck in doing so this go around. Any tips or tricks would be greatly appreciated!
If you have access to awk there's a good way to do this.
Assuming your file looks like this:
Code,a,b,c
1,x,x,x
2,x,x,x
3,x,x,x
You want a command like this:
awk -F, 'NR > 1 {print $0 >> "code" $1 ".csv"}' data.csv
That will save it to files like code1.csv etc., skipping the header line.
I have several huge (>2GB) JSON files that end in ,\n]. Here is my test file example, which is the last 25 characters of a 2 GB JSON file:
test.json
":{"value":false}}}}}},
]
I need to delete the ,\n and add back in the ] from the last three characters of the last line. The entire file is on three lines: both the front and end brackets are on their own line, and all the contents of the JSON array is on the second line.
I can't load the entire stream into memory to do something like:
string[0..-2]
because the file is way too large. I tried several approaches, including Ruby's:
chomp!(",\n]")
and UNIX's:
sed
both of which made no change to my JSON file. I viewed the last 25 characters by doing:
tail -c 25 filename.json
and also did:
ls -l
to verify that the byte size of the new and the old file versions were the same.
Can anyone help me understand why none of these approaches is working?
It's not necessary to read in the whole file if you're looking to make a surgical operation like this. Instead you can just overwrite the last few bytes in the file:
file = 'huge.json'
IO.write(file, "\n]\n", File.stat(file).size - 5)
The key here is to write as many bytes out as you back-track from the end, otherwise you'll need to trim the file length, though you can do that as well if necessary with truncate.
Is there an efficient command-line tool for prepending lines to a file inside a ZIP archive?
I have several large ZIP files containing CSV files missing their header, and I need to insert the header line. It's easy enough to write a script to extract them, prepend the header, and then re-compress, but the files are so large, it takes about 15 minutes to extract each one. Is there some tool that can edit the ZIP in-place without extracting?
Fast answer, no.
A zip file contains 1 to N file entries inside and all of them works as un splitable units, meaning that if you want to do something on an entry, you need to process this entry completely (i.e. extracting).
The only fast operation you can do is adding a new file to your archive. It will create a new entry and append it to the file, but this is probably not what you need
I have a file that is split into sections. Each section starts with a headerline e.g.:
Company,CountryA,-,-,-,-
then 1 to 20 lines of data with the following format
,,date,CompanyA,valueA1,valueA2
,,date,CompanyB,valueB1,valueB2
,,date,CompanyC,valueC1,valueC2
,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-
then more data
,,date,CompanyE,valueE1,valueE2
,,date,CompanyF,valueF1,valueF2
,,date,CompanyG,valueG1,valueG2
What I need to be able to do is convert this into a file with the following format.
Company,CountryA,-,-,-,-,,date,CompanyA,valueA1,valueA2
Company,CountryA,-,-,-,-,,date,CompanyB,valueB1,valueB2
Company,CountryA,-,-,-,-,,date,CompanyC,valueC1,valueC2
Company,CountryB,-,-,-,-,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-,,date,CompanyE,valueE1,valueE2
Company,CountryB,-,-,-,-,,date,CompanyF,valueF1,valueF2
Company,CountryB,-,-,-,-,,date,CompanyG,valueG1,valueG2
i.e. I need a script that will go through the file and read each line and if it finds a line starting company it saves it add adds it to the beginning of the subsequent line (if it begins ,,) until the line begins with Company again.
I am sure there is probably a simple way of doing this but it is beyond my simple scripting abilities.
With sed :
sed -n '/^Company/h;{/^,,/{G;s/\(.*\)\n\(.*\)/\2\1/p}}' file
Lines starting with Company are added to the hold space and subsequent lines starting with ,, are prepend with the hold line.
Say I want to delete lines 5000 - 9000 in a large data file. Can I delete a range of lines easily?
Check out the LineJumper plugin. It's marked in Package Control as ST3-only, but it should work fine with ST2, only you'll have to git clone the repository into your Packages directory. Once you've done that, open Packages/LineJumper/LineJumper.sublime-settings and edit the "number_of_lines" argument to the number of lines you want to select (in this case, 4000). Save the file, then hit CtrlG and type in 5000 to jump to line 5000. Next, hit AltShift↓ to jump down 4000 lines, selecting them all. Finally, hit Delete and you're all set. The plugin could probably be modified to open a popup to enter the lines to be selected, if you don't want to edit the .sublime-settings file every time you want to select a large block of text, but I'll leave that as an exercise for the reader :)
The manual answer with no plugins.
Cut from the first location (e.g line 5000) to the end of the file
Paste this into another/new window
Find the second location (e.g. line 4000 (=9000-5000))
Cut from here to the start of the file
Cut everything that is left and paste it back in the first file at the end.
This is easier than scrolling from start to finish of large sections you want to remove, and the effort does not depend on the size you want to remove (but it's still not fully satisfying...).