Remove last few characters from big JSON file - json

I have several huge (>2GB) JSON files that end in ,\n]. Here is my test file example, which is the last 25 characters of a 2 GB JSON file:
test.json
":{"value":false}}}}}},
]
I need to delete the ,\n and add back in the ] from the last three characters of the last line. The entire file is on three lines: both the front and end brackets are on their own line, and all the contents of the JSON array is on the second line.
I can't load the entire stream into memory to do something like:
string[0..-2]
because the file is way too large. I tried several approaches, including Ruby's:
chomp!(",\n]")
and UNIX's:
sed
both of which made no change to my JSON file. I viewed the last 25 characters by doing:
tail -c 25 filename.json
and also did:
ls -l
to verify that the byte size of the new and the old file versions were the same.
Can anyone help me understand why none of these approaches is working?

It's not necessary to read in the whole file if you're looking to make a surgical operation like this. Instead you can just overwrite the last few bytes in the file:
file = 'huge.json'
IO.write(file, "\n]\n", File.stat(file).size - 5)
The key here is to write as many bytes out as you back-track from the end, otherwise you'll need to trim the file length, though you can do that as well if necessary with truncate.

Related

CSV - Deleting lines containing numbers with wrong structure

Good afternoon all,
I was saving data from oscilloscope to USB stick as point delimited coma separated files and apparently there was some problem with transfer resulting in appearance of lines that do not match "usual" numerical format. It is hard to explain, easier to show:
1.788400e-04,0.008,0.006,0.008
1.788600e-04,-0.008,0.002,0.02
1.788800e-04,0.016,0.002,0
1.789200e-04,0,0.002.673200e-04,0.008,0.012,0.12
1.673400e-04,0,-0.002,0.008
1.673600e-04,0,0.01,0.012
1.673800e-04,0.008,0.002,0.008
What I mean is the 0.002.673200e-04 on 4th row. Luckily it is not too frequent and lines such as this can be deleted. It is however hard to find as the files are around million lines. First I thought it would be easy to do by locating the .002. and deleting it using:
grep -v ".002." testfile.csv > testfile-fixed.csv
This indeed worked, however the number between the dots changes. So far I managed to find .000. and .002. and it may not be limited to those two.
The other thing that changes is the number of columns.
Is there some easy way to get rid of these lines?
thank you
If it is OK to delete all the lines containing a number with two dots, I suggest you to use sed instead of grep.
sed '/\.[0-9]*\./d' testfile.csv > testfile-fixed.csv
This command deletes the line matching the regex \.[0-9]*\., which matches all the lines containing a dot followed by 0 or more digits and followed by a dot.
You can even do the change inside the file itself, but if you make a mistake, you can destroy your file, so make first a backup. Use the flag -i with sed:
sed -i '/\.[0-9]*\./d' testfile.csv

split big json files into small pieces without breaking the format

I'm using spark.read() to read a big json file on databricks. And it failed due to: spark driver has stopped unexpectedly and is restarting after a long time of runing.I assumed it is because the file is too big, so i decided to split it. So I used command:
split -b 100m -a 1 test.json
This actually split my files into small pieces and I can now read that on databricks. But then I found what I got is a set of null values. I think that is because i splitted the file only by the size,and some files might become files that are not in json format. For example , i might get something like this in the end of a file.
{"id":aefae3,......
Then it can't be read by spark.read.format("json").So is there any way i can seperate the json file into small pieces without breaking the json format?

TCL: file normalize gives unexpected output

I have the following line of code:
file normalize [string map {\\ /} $file]
The string map operation is to make the line work for paths containing backslashes instead of forward (as is the case in Windows)
For some values of $file (let's say it's "/path/to/my/file") I get output similar to:
/path/to/"/path/to/my/file/"
This doesn't happen for all paths but I'm unable to figure out what causes it. There are no symbolic links in the path.
Is there something I'm doing wrong, or is there an alternative to file normalize that I could try?
my tcl version is 8.5
UPDATE:
On further investigation I see that the string map is not making any difference. The output of file normalize itself is coming with that extra text before the desired text. Also, the extra text seems to be from a previous run of the code.
UPDATE 2: It was because of the quotation marks in the input to file normalize
Most likely the path has backslashes where it shouldn't have them.
% file normalize {"/path/to/some/file"}
/path/to/"/path/to/some/file"
% file normalize \"/path/to/some/file\"
/path/to/"/path/to/some/file"
Perhaps some pathname handling code escaped special characters for some reason and left the path in a mangled state.
I would try to keep the pathname pristine and when it needs to be changed for display or other processing, make a copy of it first.

Prepare a file for inputing to mysql

I have a file that is split into sections. Each section starts with a headerline e.g.:
Company,CountryA,-,-,-,-
then 1 to 20 lines of data with the following format
,,date,CompanyA,valueA1,valueA2
,,date,CompanyB,valueB1,valueB2
,,date,CompanyC,valueC1,valueC2
,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-
then more data
,,date,CompanyE,valueE1,valueE2
,,date,CompanyF,valueF1,valueF2
,,date,CompanyG,valueG1,valueG2
What I need to be able to do is convert this into a file with the following format.
Company,CountryA,-,-,-,-,,date,CompanyA,valueA1,valueA2
Company,CountryA,-,-,-,-,,date,CompanyB,valueB1,valueB2
Company,CountryA,-,-,-,-,,date,CompanyC,valueC1,valueC2
Company,CountryB,-,-,-,-,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-,,date,CompanyE,valueE1,valueE2
Company,CountryB,-,-,-,-,,date,CompanyF,valueF1,valueF2
Company,CountryB,-,-,-,-,,date,CompanyG,valueG1,valueG2
i.e. I need a script that will go through the file and read each line and if it finds a line starting company it saves it add adds it to the beginning of the subsequent line (if it begins ,,) until the line begins with Company again.
I am sure there is probably a simple way of doing this but it is beyond my simple scripting abilities.
With sed :
sed -n '/^Company/h;{/^,,/{G;s/\(.*\)\n\(.*\)/\2\1/p}}' file
Lines starting with Company are added to the hold space and subsequent lines starting with ,, are prepend with the hold line.

head and tail on Unix

My second homework assignment asks for us to write a command in Unix or Linux to extract certain sections of multiple files using head and tail. I'm not understanding how to go about this. Here is the question:
(5 points) Using head and tail, write a command to extract the second section of a file (i.e. the data section).
Turn this into an executable script called extractdata (you do not need to hand this in). Then use find and extractdata, write a command to get the second section of all .csv files in the month directories, and place the output into a file called polls.csv. Be sure to keep this file in your homedir. You will use it again on the next assignment. [hint] Inside the script don't forget the command line variable $1. example:
head -52 $1
The .csv files consist of three parts: (1) a two line header, describing the fields; (2) 51 lines representing data for each state (plus Washington DC); (3) the rest of the file is summary information. The data fields for each state in the second part is comma separated.
I have to get the second section.
Thank you.
Take it in stages:
Read what head and tail both do, (get the first and last n lines)
think about what you need (the middle 51 lines)
how can you do that?
Use head to extract the first 53 lines. Use tail to extract the last 51 lines of the result (effectively ignoring the first 2 header lines).
The problem I had was figuring out how to get the data from multiple .csv files. I used wild cards to solve my issue.
If anyone else needed to know I used this:
head -n 53 $1 /usr/local/tmp/election2008/*/*.csv | tail -n 51 $1