head and tail on Unix - csv

My second homework assignment asks for us to write a command in Unix or Linux to extract certain sections of multiple files using head and tail. I'm not understanding how to go about this. Here is the question:
(5 points) Using head and tail, write a command to extract the second section of a file (i.e. the data section).
Turn this into an executable script called extractdata (you do not need to hand this in). Then use find and extractdata, write a command to get the second section of all .csv files in the month directories, and place the output into a file called polls.csv. Be sure to keep this file in your homedir. You will use it again on the next assignment. [hint] Inside the script don't forget the command line variable $1. example:
head -52 $1
The .csv files consist of three parts: (1) a two line header, describing the fields; (2) 51 lines representing data for each state (plus Washington DC); (3) the rest of the file is summary information. The data fields for each state in the second part is comma separated.
I have to get the second section.
Thank you.

Take it in stages:
Read what head and tail both do, (get the first and last n lines)
think about what you need (the middle 51 lines)
how can you do that?

Use head to extract the first 53 lines. Use tail to extract the last 51 lines of the result (effectively ignoring the first 2 header lines).

The problem I had was figuring out how to get the data from multiple .csv files. I used wild cards to solve my issue.
If anyone else needed to know I used this:
head -n 53 $1 /usr/local/tmp/election2008/*/*.csv | tail -n 51 $1

Related

CSV - Deleting lines containing numbers with wrong structure

Good afternoon all,
I was saving data from oscilloscope to USB stick as point delimited coma separated files and apparently there was some problem with transfer resulting in appearance of lines that do not match "usual" numerical format. It is hard to explain, easier to show:
1.788400e-04,0.008,0.006,0.008
1.788600e-04,-0.008,0.002,0.02
1.788800e-04,0.016,0.002,0
1.789200e-04,0,0.002.673200e-04,0.008,0.012,0.12
1.673400e-04,0,-0.002,0.008
1.673600e-04,0,0.01,0.012
1.673800e-04,0.008,0.002,0.008
What I mean is the 0.002.673200e-04 on 4th row. Luckily it is not too frequent and lines such as this can be deleted. It is however hard to find as the files are around million lines. First I thought it would be easy to do by locating the .002. and deleting it using:
grep -v ".002." testfile.csv > testfile-fixed.csv
This indeed worked, however the number between the dots changes. So far I managed to find .000. and .002. and it may not be limited to those two.
The other thing that changes is the number of columns.
Is there some easy way to get rid of these lines?
thank you
If it is OK to delete all the lines containing a number with two dots, I suggest you to use sed instead of grep.
sed '/\.[0-9]*\./d' testfile.csv > testfile-fixed.csv
This command deletes the line matching the regex \.[0-9]*\., which matches all the lines containing a dot followed by 0 or more digits and followed by a dot.
You can even do the change inside the file itself, but if you make a mistake, you can destroy your file, so make first a backup. Use the flag -i with sed:
sed -i '/\.[0-9]*\./d' testfile.csv

Splitting a CSV File by value of specific column

I have multiple CSV files that I need to split into 67 separate files each. Each sheet has over a million rows and dozens of columns. One of the columns is called "Code" and it ranges from 1 to 67 which is what I have to base the split on. I have been doing this split manually by selecting all of the rows within each value (1, 2, 3, etc) and pasting them into their own CSV file and saving them, but this is taking way too long. I usually use ArcGIS to create some kind of batch file split, but I am not having much luck in doing so this go around. Any tips or tricks would be greatly appreciated!
If you have access to awk there's a good way to do this.
Assuming your file looks like this:
Code,a,b,c
1,x,x,x
2,x,x,x
3,x,x,x
You want a command like this:
awk -F, 'NR > 1 {print $0 >> "code" $1 ".csv"}' data.csv
That will save it to files like code1.csv etc., skipping the header line.

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.

Remove last few characters from big JSON file

I have several huge (>2GB) JSON files that end in ,\n]. Here is my test file example, which is the last 25 characters of a 2 GB JSON file:
test.json
":{"value":false}}}}}},
]
I need to delete the ,\n and add back in the ] from the last three characters of the last line. The entire file is on three lines: both the front and end brackets are on their own line, and all the contents of the JSON array is on the second line.
I can't load the entire stream into memory to do something like:
string[0..-2]
because the file is way too large. I tried several approaches, including Ruby's:
chomp!(",\n]")
and UNIX's:
sed
both of which made no change to my JSON file. I viewed the last 25 characters by doing:
tail -c 25 filename.json
and also did:
ls -l
to verify that the byte size of the new and the old file versions were the same.
Can anyone help me understand why none of these approaches is working?
It's not necessary to read in the whole file if you're looking to make a surgical operation like this. Instead you can just overwrite the last few bytes in the file:
file = 'huge.json'
IO.write(file, "\n]\n", File.stat(file).size - 5)
The key here is to write as many bytes out as you back-track from the end, otherwise you'll need to trim the file length, though you can do that as well if necessary with truncate.

Prepare a file for inputing to mysql

I have a file that is split into sections. Each section starts with a headerline e.g.:
Company,CountryA,-,-,-,-
then 1 to 20 lines of data with the following format
,,date,CompanyA,valueA1,valueA2
,,date,CompanyB,valueB1,valueB2
,,date,CompanyC,valueC1,valueC2
,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-
then more data
,,date,CompanyE,valueE1,valueE2
,,date,CompanyF,valueF1,valueF2
,,date,CompanyG,valueG1,valueG2
What I need to be able to do is convert this into a file with the following format.
Company,CountryA,-,-,-,-,,date,CompanyA,valueA1,valueA2
Company,CountryA,-,-,-,-,,date,CompanyB,valueB1,valueB2
Company,CountryA,-,-,-,-,,date,CompanyC,valueC1,valueC2
Company,CountryB,-,-,-,-,,date,CompanyD,valueD1,valueD2
Company,CountryB,-,-,-,-,,date,CompanyE,valueE1,valueE2
Company,CountryB,-,-,-,-,,date,CompanyF,valueF1,valueF2
Company,CountryB,-,-,-,-,,date,CompanyG,valueG1,valueG2
i.e. I need a script that will go through the file and read each line and if it finds a line starting company it saves it add adds it to the beginning of the subsequent line (if it begins ,,) until the line begins with Company again.
I am sure there is probably a simple way of doing this but it is beyond my simple scripting abilities.
With sed :
sed -n '/^Company/h;{/^,,/{G;s/\(.*\)\n\(.*\)/\2\1/p}}' file
Lines starting with Company are added to the hold space and subsequent lines starting with ,, are prepend with the hold line.