Prepare a file for inputing to mysql - mysql

I have a file that is split into sections. Each section starts with a headerline e.g.:
Company,CountryA,-,-,-,-
then 1 to 20 lines of data with the following format
,,date,CompanyA,valueA1,valueA2
,,date,CompanyB,valueB1,valueB2
,,date,CompanyC,valueC1,valueC2
,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-
then more data
,,date,CompanyE,valueE1,valueE2
,,date,CompanyF,valueF1,valueF2
,,date,CompanyG,valueG1,valueG2
What I need to be able to do is convert this into a file with the following format.
Company,CountryA,-,-,-,-,,date,CompanyA,valueA1,valueA2
Company,CountryA,-,-,-,-,,date,CompanyB,valueB1,valueB2
Company,CountryA,-,-,-,-,,date,CompanyC,valueC1,valueC2
Company,CountryB,-,-,-,-,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-,,date,CompanyE,valueE1,valueE2
Company,CountryB,-,-,-,-,,date,CompanyF,valueF1,valueF2
Company,CountryB,-,-,-,-,,date,CompanyG,valueG1,valueG2
i.e. I need a script that will go through the file and read each line and if it finds a line starting company it saves it add adds it to the beginning of the subsequent line (if it begins ,,) until the line begins with Company again.
I am sure there is probably a simple way of doing this but it is beyond my simple scripting abilities.

With sed :
sed -n '/^Company/h;{/^,,/{G;s/\(.*\)\n\(.*\)/\2\1/p}}' file
Lines starting with Company are added to the hold space and subsequent lines starting with ,, are prepend with the hold line.

Related

CSV - Deleting lines containing numbers with wrong structure

Good afternoon all,
I was saving data from oscilloscope to USB stick as point delimited coma separated files and apparently there was some problem with transfer resulting in appearance of lines that do not match "usual" numerical format. It is hard to explain, easier to show:
1.788400e-04,0.008,0.006,0.008
1.788600e-04,-0.008,0.002,0.02
1.788800e-04,0.016,0.002,0
1.789200e-04,0,0.002.673200e-04,0.008,0.012,0.12
1.673400e-04,0,-0.002,0.008
1.673600e-04,0,0.01,0.012
1.673800e-04,0.008,0.002,0.008
What I mean is the 0.002.673200e-04 on 4th row. Luckily it is not too frequent and lines such as this can be deleted. It is however hard to find as the files are around million lines. First I thought it would be easy to do by locating the .002. and deleting it using:
grep -v ".002." testfile.csv > testfile-fixed.csv
This indeed worked, however the number between the dots changes. So far I managed to find .000. and .002. and it may not be limited to those two.
The other thing that changes is the number of columns.
Is there some easy way to get rid of these lines?
thank you
If it is OK to delete all the lines containing a number with two dots, I suggest you to use sed instead of grep.
sed '/\.[0-9]*\./d' testfile.csv > testfile-fixed.csv
This command deletes the line matching the regex \.[0-9]*\., which matches all the lines containing a dot followed by 0 or more digits and followed by a dot.
You can even do the change inside the file itself, but if you make a mistake, you can destroy your file, so make first a backup. Use the flag -i with sed:
sed -i '/\.[0-9]*\./d' testfile.csv

Remove last few characters from big JSON file

I have several huge (>2GB) JSON files that end in ,\n]. Here is my test file example, which is the last 25 characters of a 2 GB JSON file:
test.json
":{"value":false}}}}}},
]
I need to delete the ,\n and add back in the ] from the last three characters of the last line. The entire file is on three lines: both the front and end brackets are on their own line, and all the contents of the JSON array is on the second line.
I can't load the entire stream into memory to do something like:
string[0..-2]
because the file is way too large. I tried several approaches, including Ruby's:
chomp!(",\n]")
and UNIX's:
sed
both of which made no change to my JSON file. I viewed the last 25 characters by doing:
tail -c 25 filename.json
and also did:
ls -l
to verify that the byte size of the new and the old file versions were the same.
Can anyone help me understand why none of these approaches is working?
It's not necessary to read in the whole file if you're looking to make a surgical operation like this. Instead you can just overwrite the last few bytes in the file:
file = 'huge.json'
IO.write(file, "\n]\n", File.stat(file).size - 5)
The key here is to write as many bytes out as you back-track from the end, otherwise you'll need to trim the file length, though you can do that as well if necessary with truncate.

How to delete every X line of a very large data file?

I've got a very large .csv file, which contains 10 million lines of data. The file size is around 250 MB. Each line contains three values and looks like this:
-9.8199980e-03,183,-4.32
I want to delete every 2nd line or e.g. copy every 10th line straight into a new file. Which program should I use and can you also post the code?
I tried it with Scilab and Excel; they couldn't open the file or just a small part of it. I can open the file in Notepad++, but when I tried to record and run a macro, which deletes every 2nd line, it crashed.
I would recommend you install gawk/awk from here and harness the power of this brilliant tool.
If you want every other line:
gawk "NR%2" original.csv > new.csv
If you want every 10th line:
gawk 'NR%10==0" original.csv > new.csv

How to prepend a line to a file inside a zip file

Is there an efficient command-line tool for prepending lines to a file inside a ZIP archive?
I have several large ZIP files containing CSV files missing their header, and I need to insert the header line. It's easy enough to write a script to extract them, prepend the header, and then re-compress, but the files are so large, it takes about 15 minutes to extract each one. Is there some tool that can edit the ZIP in-place without extracting?
Fast answer, no.
A zip file contains 1 to N file entries inside and all of them works as un splitable units, meaning that if you want to do something on an entry, you need to process this entry completely (i.e. extracting).
The only fast operation you can do is adding a new file to your archive. It will create a new entry and append it to the file, but this is probably not what you need

head and tail on Unix

My second homework assignment asks for us to write a command in Unix or Linux to extract certain sections of multiple files using head and tail. I'm not understanding how to go about this. Here is the question:
(5 points) Using head and tail, write a command to extract the second section of a file (i.e. the data section).
Turn this into an executable script called extractdata (you do not need to hand this in). Then use find and extractdata, write a command to get the second section of all .csv files in the month directories, and place the output into a file called polls.csv. Be sure to keep this file in your homedir. You will use it again on the next assignment. [hint] Inside the script don't forget the command line variable $1. example:
head -52 $1
The .csv files consist of three parts: (1) a two line header, describing the fields; (2) 51 lines representing data for each state (plus Washington DC); (3) the rest of the file is summary information. The data fields for each state in the second part is comma separated.
I have to get the second section.
Thank you.
Take it in stages:
Read what head and tail both do, (get the first and last n lines)
think about what you need (the middle 51 lines)
how can you do that?
Use head to extract the first 53 lines. Use tail to extract the last 51 lines of the result (effectively ignoring the first 2 header lines).
The problem I had was figuring out how to get the data from multiple .csv files. I used wild cards to solve my issue.
If anyone else needed to know I used this:
head -n 53 $1 /usr/local/tmp/election2008/*/*.csv | tail -n 51 $1