I wrote the following code:
BEGIN{FS=OFS=","}
NR==FNR &&
$7{sum+=$7;
elementos++;
next}
!$7{$7=media}
{print}
ENDFILE{media=sum/elementos}
This awk script adds the average age to the empty cells on column 'age'.
Execution of the code is done as follows:
awk -f c_awk.awk train3.csv
Now I am trying to save the changes done in a new CSV file using awk. (new file: train4.csv)
I have been trying with
> ./c_awk.awk/train4.csv in the last line but it doesn't work.
awk: c_awk.awk:12: ENDFILE{media=sum/elementos}> /tmp/train4.csv
awk: c_awk.awk:12: ^ syntax error
awk: c_awk.awk:12: ENDFILE{media=sum/elementos}> /tmp/train4.csv
awk: c_awk.awk:12: ^ syntax error
The file from where I am trying to implement the changes looks like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
The expected result is the following:
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
Thanks.
awk -f c_awk.awk train3.csv > /tmp/train4.csv
or just change {print} to {print > "/tmp/train4.csv"}
Your script contains several issues unrelated to writing to a new file that I suspect you aren't aware of - you should post a new question with concise, testable sample input and expected output so we can help you with that as it's not entirely clear what you want that script to do
Related
here is how my dataset looks like, I am trying to filter out country that the 4th column is >= 1000.
Marshall Islands,53127,77,41
Vanuatu,276244,25,70
Solomon Islands,611343,23,142
Sao Tome and Principe,204327,72,147
Belize,374681,46,171
Maldives,436330,39,172
Guyana,777859,27,206
Eswatini,1367254,24,323
Timor-Leste,1296311,30,392
Lesotho,2233339,28,619
Guinea-Bissau,1861283,43,799
Namibia,2533794,49,1242
Gambia,2100568,61,1273
.
.
.
Zimbabwe,16529904,32,5329
(total 77 lines of data)
I have tried to run the following command on my terminal, but it only output 1 line of the dataset to new file.
awk -F, '$4 > 999' original.csv > new.csv
*update, all line except Zimbabwe are ending with ^M$.
Here is desired output
Namibia,2533794,49,1242
Gambia,2100568,61,1273
Burundi,10864245,13,1380
Armenia,2930450,63,1849
Rwanda,12208407,17,2091
Mongolia,3075647,68,2103
Kyrgyzstan,6045117,36,2184
Mauritania,4420184,53,2335
Lao People's Democratic Republic,6858160,34,2357
Liberia,4731906,51,2399
Tajikistan,8921343,27,2407
Sierra Leone,7557212,42,3147
Togo,7797694,41,3210
Chad,14899994,23,3406
Congo,5260750,66,3496
Cambodia,16005373,23,3678
Paraguay,6811297,61,4175
El Salvador,6377853,71,4546
Guinea,12717176,36,4552
Benin,11175692,47,5227
Zimbabwe,16529904,32,5329
Azerbaijan,9827589,55,5439
Burkina Faso,19193383,29,5517
Nepal,29304998,19,5666
Haiti,10981229,54,5968
Somalia,14742523,44,6544
Zambia,17094131,43,7346
Senegal,15850567,47,7409
Bolivia (Plurinational State of),11051600,69,7634
Mali,18541980,42,7708
Tunisia,11532127,69,7916
Guatemala,16913504,51,8572
Dominican Republic,10766998,80,8643
Cuba,11484636,77,8841
Afghanistan,35530082,25,8971
Syrian Arab Republic,18269867,54,9774
Uganda,42862957,23,9942
Yemen,28250420,36,10175
Kazakhstan,18204498,57,10438
Ecuador,16624857,64,10585
Côte d'Ivoire,24294750,50,12227
Kenya,49699863,27,13201
Cameroon,24053727,56,13416
Sudan,40533328,34,13931
Ghana,28833629,55,15976
Myanmar,53370609,30,16183
United Republic of Tanzania,57310020,33,18943
Angola,29784193,65,19312
Ethiopia,104957438,20,21317
Peru,32165484,78,24999
Iraq,38274617,70,26899
Algeria,41318141,72,29771
Viet Nam,95540797,35,33643
Thailand,69037516,49,33966
Democratic Republic of the Congo,81339984,44,35692
South Africa,56717156,66,37348
Colombia,49065613,80,39471
Egypt,97553148,43,41660
Philippines,104918094,47,48978
Bangladesh,164669750,36,59047
Pakistan,197015953,36,71797
Nigeria,190886313,50,94525
Mexico,129163273,80,103159
Indonesia,263991375,55,144295
India,1339180125,34,449965
Does anyone have suggestions on how to fix this issue?
Assuming that your Input_file's last field may have spaces in it. You can also check it by doing cat -e Input_file it will show you where is line ending including hidden spaces at the line end. If this is the case then try following command.
awk 'BEGIN{FS=","} $4+0 > 999' Input_file
sorry if this is a newbie question, but I didn't find the answer to this particular question on stackoverflow.
I have a (very large) fixed-width data file that looks like this:
simplefile.txt
ratno fdate ratname typecode country
12346 31/12/2010 HARTZ 4 UNITED STATES
12444 31/12/2010 CHRISTIE 5 UNITED STATES
12527 31/12/2010 HILL AIR 4 UNITED STATES
15000 31/12/2010 TOKUGAVA INC. 5 JAPAN
37700 31/12/2010 HARTLAND 1 UNITED KINGDOM
37700 31/12/2010 WILDER 1 UNITED STATES
18935 31/12/2010 FLOWERS FINAL SERVICES INC 5 UNITED STATES
37700 31/12/2010 MAPLE CORPORATION 1 CANADA
48614 31/12/2010 SERIAL MGMT L.P. 5 UNITED STATES
1373 31/12/2010 AMORE MGMT GROUP N A 1 UNITED STATES
I am trying to convert it into a csv file using the terminal (the file is too big for Excel) that would look like this:
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
15000,31/12/2010,TOKUGAVA INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED KINGDOM
37700,31/12/2010,WILDER,1,UNITED STATES
18935,31/12/2010,FLOWERS FINAL SERVICES INC,5,UNITED STATES
37700,31/12/2010,MAPLE CORPORATION,1,CANADA
48614,31/12/2010,SERIAL MGMT L.P.,5,UNITED STATES
1373,31/12/2010,AMORE MGMT GROUP N A,1,UNITED STATES
I dug a bit around on this site and found a possible solution that relies on the awk shell command:
awk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}' "simpletestfile.txt"
However, when I execute the above command in the terminal, it inadvertently also inserts commas in all white spaces, inside the separate words of what is supposed to remain a single field. The result of the above execution is as follows:
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED,STATES
12444,31/12/2010,CHRISTIE,5,UNITED,STATES
12527,31/12/2010,HILL,AIR,4,UNITED,STATES
15000,31/12/2010,TOKUGAVA,INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED,KINGDOM
37700,31/12/2010,WILDER,1,UNITED,STATES
18935,31/12/2010,FLOWERS,FINAL,SERVICES,INC,5,UNITED,STATES
37700,31/12/2010,MAPLE,CORPORATION,1,CANADA
48614,31/12/2010,SERIAL,MGMT,L.P.,5,UNITED,STATES
1373,31/12/2010,AMORE,MGMT,GROUP,N,A,1,UNITED,STATES
How can I avoid inserting commas in white spaces outside of delineated fieldwidths? Thank you!
Your attempt was good, but requires gawk (gnu awk) for the FIELDWIDTHS built-in variable. With gawk:
$ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}' file
ratno, fdate, ratname , typecode, country
12346, 31/12/2010, HARTZ , 4 , UNITED STATES
12444, 31/12/2010, CHRISTIE , 5 , UNITED STATES
12527, 31/12/2010, HILL AIR , 4 , UNITED STATES
Assuming you don't want the extra spaces, you can do instead:
$ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{for (i=1; i<=NF; ++i) gsub(/^ *| *$/, "", $i)}1' file
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
If you don't have gnu awk, you can achieve the same results with:
$ awk -v fieldwidths="5 11 31 9 16" '
BEGIN { OFS=","; split(fieldwidths, widths) }
{
rec = $0
$0 = ""
start = 1;
for (i=1; i<=length(widths); ++i) {
$i = substr(rec, start, widths[i])
gsub(/^ *| *$/, "", $i)
start += widths[i]
}
}1' file
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
perl is handy here:
perl -nE ' # read this bottom to top
say join ",",
map {s/^\s+|\s+$//g; $_} # trim leading/trailing whitespace
/^(.{5}) (.{10}) (.{30}) (.{8}) (.*)/ # extract the fields
' simplefile.txt
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
15000,31/12/2010,TOKUGAVA INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED KINGDOM
37700,31/12/2010,WILDER,1,UNITED STATES
18935,31/12/2010,FLOWERS FINAL SERVICES INC,5,UNITED STATES
37700,31/12/2010,MAPLE CORPORATION,1,CANADA
48614,31/12/2010,SERIAL MGMT L.P.,5,UNITED STATES
1373,31/12/2010,AMORE MGMT GROUP N A,1,UNITED STATES
Although, for proper CSV, we need to be a bit cautious about fields containing commas or quotes. If I was feeling less secure about the contents of the file, I'd use this map block:
map {s/^\s+|\s+$//g; s/"/""/g; qq("$_")}
which outputs
"ratno","fdate","ratname","typecode","country"
"12346","31/12/2010","HARTZ","4","UNITED STATES"
"12444","31/12/2010","CHRISTIE","5","UNITED STATES"
"12527","31/12/2010","HILL AIR","4","UNITED STATES"
"15000","31/12/2010","TOKUGAVA INC.","5","JAPAN"
"37700","31/12/2010","HARTLAND","1","UNITED KINGDOM"
"37700","31/12/2010","WILDER","1","UNITED STATES"
"18935","31/12/2010","FLOWERS FINAL SERVICES INC","5","UNITED STATES"
"37700","31/12/2010","MAPLE CORPORATION","1","CANADA"
"48614","31/12/2010","SERIAL MGMT L.P.","5","UNITED STATES"
"1373","31/12/2010","AMORE MGMT GROUP N A","1","UNITED STATES"
I need a little help, in our class we've been playing around with GREP and SED commands in an attempt to learn how they work. More specifically we've been using sed commands to manipulate text and add tags.
So, we we're given an assignment, we've been given 500 lines of CSV fake data and it is our job to create a sed command that will automatically tag the data and tag any new data added down the road (theoretically).
Here's a few lines of our fake UN-TAGGED data, this is by default how we received it, as you can see all the data starts with a first name and ends with a web email:
FirstName,LastName,Company,Address,City,County,State,ZIP,Phone,Fax,Email,Web
"Essie","Vaill","Litronic Industries","14225 Hancock Dr","Anchorage","Anchorage","AK","99515","907-345-0962","907-345-1215","essie#vaill.com","http://www.essievaill.com"
"Cruz","Roudabush","Meridian Products","2202 S Central Ave","Phoenix","Maricopa","AZ","85004","602-252-4827","602-252-4009","cruz#roudabush.com","http://www.cruzroudabush.com"
"Billie","Tinnes","D & M Plywood Inc","28 W 27th St","New York","New York","NY","10001","212-889-5775","212-889-5764","billie#tinnes.com","http://www.billietinnes.com"
"Zackary","Mockus","Metropolitan Elevator Co","286 State St","Perth Amboy","Middlesex","NJ","08861","732-442-0638","732-442-5218","zackary#mockus.com","http://www.zackarymockus.com"
"Rosemarie","Fifield","Technology Services","3131 N Nimitz Hwy #-105","Honolulu","Honolulu","HI","96819","808-836-8966","808-836-6008","rosemarie#fifield.com","http://www.rosemariefifield.com"
"Bernard","Laboy","Century 21 Keewaydin Prop","22661 S Frontage Rd","Channahon","Will","IL","60410","815-467-0487","815-467-1244","bernard#laboy.com","http://www.bernardlaboy.com"
"Sue","Haakinson","Kim Peacock Beringhause","9617 N Metro Pky W","Phoenix","Maricopa","AZ","85051","602-953-2753","602-953-0355","sue#haakinson.com","http://www.suehaakinson.com"
"Valerie","Pou","Sea Port Record One Stop Inc","7475 Hamilton Blvd","Trexlertown","Lehigh","PA","18087","610-395-8743","610-395-6995","valerie#pou.com","http://www.valeriepou.com"
"Lashawn","Hasty","Kpff Consulting Engineers","815 S Glendora Ave","West Covina","Los Angeles","CA","91790","626-960-6738","626-960-1503","lashawn#hasty.com","http://www.lashawnhasty.com"
"Marianne","Earman","Albers Technologies Corp","6220 S Orange Blossom Trl","Orlando","Orange","FL","32809","407-857-0431","407-857-2506","marianne#earman.com","http://www.marianneearman.com"
"Justina","Dragaj","Uchner, David D Esq","2552 Poplar Ave","Memphis","Shelby","TN","38112","901-327-5336","901-327-2911","justina#dragaj.com","http://www.justinadragaj.com"
"Mandy","Mcdonnell","Southern Vermont Surveys","343 Bush St Se","Salem","Marion","OR","97302","503-371-8219","503-371-1118","mandy#mcdonnell.com","http://www.mandymcdonnell.com"
"Conrad","Lanfear","Kahler, Karen T Esq","49 Roche Way","Youngstown","Mahoning","OH","44512","330-758-0314","330-758-3536","conrad#lanfear.com","http://www.conradlanfear.com"
"Cyril","Behen","National Paper & Envelope Corp","1650 S Harbor Blvd","Anaheim","Orange","CA","92802","714-772-5050","714-772-3859","cyril#behen.com","http://www.cyrilbehen.com"
"Shelley","Groden","Norton, Robert L Esq","110 Broadway St","San Antonio","Bexar","TX","78205","210-229-3017","210-229-9757","shelley#groden.com","http://www.shelleygroden.com"
Our teacher wanted us to create sed commands that would automatically indent the data, add TR to the front and back of the data and add TD tags to each new field.
<HTML>
<HEAD><Title>Lab 4b by Andrey</Title></HEAD>
<BODY>
<table border="1">
<TR><TD>FirstName</TD><TD>LastName</TD><TD>Company</TD><TD>Address</TD><TD>City</TD><TD>County</TD><TD>State</TD><TD>ZIP</TD><TD>Phone</TD><TD>Fax</TD><TD>Email</TD><TD>Web</TD></TR>
<TR><TD>Essie</TD><TD>Vaill</TD><TD>Litronic Industries</TD><TD>14225 Hancock Dr</TD><TD>Anchorage</TD><TD>Anchorage</TD><TD>AK</TD><TD>99515</TD><TD>907-345-0962</TD><TD>907-345-1215</TD><TD>essie#vaill.com</TD><TD>http://www.essievaill.com</TD><TR>
<TR><TD>Cruz</TD><TD>Roudabush</TD><TD>Meridian Products</TD><TD>2202 S Central Ave</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85004</TD><TD>602-252-4827</TD><TD>602-252-4009</TD><TD>cruz#roudabush.com</TD><TD>http://www.cruzroudabush.com</TD><TR>
<TR><TD>Billie</TD><TD>Tinnes</TD><TD>D & M Plywood Inc</TD><TD>28 W 27th St</TD><TD>New York</TD><TD>New York</TD><TD>NY</TD><TD>10001</TD><TD>212-889-5775</TD><TD>212-889-5764</TD><TD>billie#tinnes.com</TD><TD>http://www.billietinnes.com</TD><TR>
<TR><TD>Zackary</TD><TD>Mockus</TD><TD>Metropolitan Elevator Co</TD><TD>286 State St</TD><TD>Perth Amboy</TD><TD>Middlesex</TD><TD>NJ</TD><TD>08861</TD><TD>732-442-0638</TD><TD>732-442-5218</TD><TD>zackary#mockus.com</TD><TD>http://www.zackarymockus.com</TD><TR>
<TR><TD>Rosemarie</TD><TD>Fifield</TD><TD>Technology Services</TD><TD>3131 N Nimitz Hwy #-105</TD><TD>Honolulu</TD><TD>Honolulu</TD><TD>HI</TD><TD>96819</TD><TD>808-836-8966</TD><TD>808-836-6008</TD><TD>rosemarie#fifield.com</TD><TD>http://www.rosemariefifield.com<$
<TR><TD>Bernard</TD><TD>Laboy</TD><TD>Century 21 Keewaydin Prop</TD><TD>22661 S Frontage Rd</TD><TD>Channahon</TD><TD>Will</TD><TD>IL</TD><TD>60410</TD><TD>815-467-0487</TD><TD>815-467-1244</TD><TD>bernard#laboy.com</TD><TD>http://www.bernardlaboy.com</TD><TR>
<TR><TD>Sue</TD><TD>Haakinson</TD><TD>Kim Peacock Beringhause</TD><TD>9617 N Metro Pky W</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85051</TD><TD>602-953-2753</TD><TD>602-953-0355</TD><TD>sue#haakinson.com</TD><TD>http://www.suehaakinson.com</TD><TR>
<TR><TD>Valerie</TD><TD>Pou</TD><TD>Sea Port Record One Stop Inc</TD><TD>7475 Hamilton Blvd</TD><TD>Trexlertown</TD><TD>Lehigh</TD><TD>PA</TD><TD>18087</TD><TD>610-395-8743</TD><TD>610-395-6995</TD><TD>valerie#pou.com</TD><TD>http://www.valeriepou.com</TD><TR>
<TR><TD>Lashawn</TD><TD>Hasty</TD><TD>Kpff Consulting Engineers</TD><TD>815 S Glendora Ave</TD><TD>West Covina</TD><TD>Los Angeles</TD><TD>CA</TD><TD>91790</TD><TD>626-960-6738</TD><TD>626-960-1503</TD><TD>lashawn#hasty.com</TD><TD>http://www.lashawnhasty.com</TD><T$
<TR><TD>Marianne</TD><TD>Earman</TD><TD>Albers Technologies Corp</TD><TD>6220 S Orange Blossom Trl</TD><TD>Orlando</TD><TD>Orange</TD><TD>FL</TD><TD>32809</TD><TD>407-857-0431</TD><TD>407-857-2506</TD><TD>marianne#earman.com</TD><TD>http://www.marianneearman.com</TD$
<TR><TD>Justina</TD><TD>Dragaj</TD><TD>Uchner David D Esq</TD><TD>2552 Poplar Ave</TD><TD>Memphis</TD><TD>Shelby</TD><TD>TN</TD><TD>38112</TD><TD>901-327-5336</TD><TD>901-327-2911</TD><TD>justina#dragaj.com</TD><TD>http://www.justinadragaj.com</TD><TR>
<TR><TD>Mandy</TD><TD>Mcdonnell</TD><TD>Southern Vermont Surveys</TD><TD>343 Bush St Se</TD><TD>Salem</TD><TD>Marion</TD><TD>OR</TD><TD>97302</TD><TD>503-371-8219</TD><TD>503-371-1118</TD><TD>mandy#mcdonnell.com</TD><TD>http://www.mandymcdonnell.com</TD><TR>
<TR><TD>Conrad</TD><TD>Lanfear</TD><TD>Kahler Karen T Esq</TD><TD>49 Roche Way</TD><TD>Youngstown</TD><TD>Mahoning</TD><TD>OH</TD><TD>44512</TD><TD>330-758-0314</TD><TD>330-758-3536</TD><TD>conrad#lanfear.com</TD><TD>http://www.conradlanfear.com</TD><TR>
<TR><TD>Cyril</TD><TD>Behen</TD><TD>National Paper & Envelope Corp</TD><TD>1650 S Harbor Blvd</TD><TD>Anaheim</TD><TD>Orange</TD><TD>CA</TD><TD>92802</TD><TD>714-772-5050</TD><TD>714-772-3859</TD><TD>cyril#behen.com</TD><TD>http://www.cyrilbehen.com</TD><TR>
<TR><TD>Shelley</TD><TD>Groden</TD><TD>Norton Robert L Esq</TD><TD>110 Broadway St</TD><TD>San Antonio</TD><TD>Bexar</TD><TD>TX</TD><TD>78205</TD><TD>210-229-3017</TD><TD>210-229-9757</TD><TD>shelley#groden.com</TD><TD>http://www.shelleygroden.com</TD><TR>
</table>
</BODY>
</HTML>
So, I was messing around and I tired to create a few sed commands that would mimic the second output.
My first attempt was:
#!/bin/sh
sed -e 's=^.*$=<TR><TD>&</TD></TR>=' input.csv
Unfortunately, this program only outputs something like this where I get TR TD at the beginning and end, but no TD tags inside:
<TR><TD>"Bryan","Rovell","All N All Shop","90 Hackensack St","East Rutherford","Bergen","NJ","07073","201-939-2788","201-939-9079","bryan#rovell.com","http://www.bryanrovell.com"</TD></TR>
<TR><TD>"Joey","Bolick","Utility Trailer Sales","7700 N Council Rd","Oklahoma City","Oklahoma","OK","73132","405-728-5972","405-728-5244","joey#bolick.com","http://www.joeybolick.com"</TD></TR>
I've also attempted to create individual seds to tag field, but instead I've only managed to tag each word, so I'm kinda stuck.
I'm partially on the right track, I think, but I need helping indenting and adding TD to the beginning & end of every field, along with TR to the beginning and end of each new column.
This is the main part of it:
$ sed -r 's:^"?: <TR><TD>:; s:"?,"?:</TD><TD>:g; s:"?$:</TD></TR>:' file
<TR><TD>FirstName</TD><TD>LastName</TD><TD>Company</TD><TD>Address</TD><TD>City</TD><TD>County</TD><TD>State</TD><TD>ZIP</TD><TD>Phone</TD><TD>Fax</TD><TD>Email</TD><TD>Web</TD></TR>
<TR><TD>Essie</TD><TD>Vaill</TD><TD>Litronic Industries</TD><TD>14225 Hancock Dr</TD><TD>Anchorage</TD><TD>Anchorage</TD><TD>AK</TD><TD>99515</TD><TD>907-345-0962</TD><TD>907-345-1215</TD><TD>essie#vaill.com</TD><TD>http://www.essievaill.com</TD></TR>
<TR><TD>Cruz</TD><TD>Roudabush</TD><TD>Meridian Products</TD><TD>2202 S Central Ave</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85004</TD><TD>602-252-4827</TD><TD>602-252-4009</TD><TD>cruz#roudabush.com</TD><TD>http://www.cruzroudabush.com</TD></TR>
<TR><TD>Billie</TD><TD>Tinnes</TD><TD>D & M Plywood Inc</TD><TD>28 W 27th St</TD><TD>New York</TD><TD>New York</TD><TD>NY</TD><TD>10001</TD><TD>212-889-5775</TD><TD>212-889-5764</TD><TD>billie#tinnes.com</TD><TD>http://www.billietinnes.com</TD></TR>
<TR><TD>Zackary</TD><TD>Mockus</TD><TD>Metropolitan Elevator Co</TD><TD>286 State St</TD><TD>Perth Amboy</TD><TD>Middlesex</TD><TD>NJ</TD><TD>08861</TD><TD>732-442-0638</TD><TD>732-442-5218</TD><TD>zackary#mockus.com</TD><TD>http://www.zackarymockus.com</TD></TR>
<TR><TD>Rosemarie</TD><TD>Fifield</TD><TD>Technology Services</TD><TD>3131 N Nimitz Hwy #-105</TD><TD>Honolulu</TD><TD>Honolulu</TD><TD>HI</TD><TD>96819</TD><TD>808-836-8966</TD><TD>808-836-6008</TD><TD>rosemarie#fifield.com</TD><TD>http://www.rosemariefifield.com</TD></TR>
<TR><TD>Bernard</TD><TD>Laboy</TD><TD>Century 21 Keewaydin Prop</TD><TD>22661 S Frontage Rd</TD><TD>Channahon</TD><TD>Will</TD><TD>IL</TD><TD>60410</TD><TD>815-467-0487</TD><TD>815-467-1244</TD><TD>bernard#laboy.com</TD><TD>http://www.bernardlaboy.com</TD></TR>
<TR><TD>Sue</TD><TD>Haakinson</TD><TD>Kim Peacock Beringhause</TD><TD>9617 N Metro Pky W</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85051</TD><TD>602-953-2753</TD><TD>602-953-0355</TD><TD>sue#haakinson.com</TD><TD>http://www.suehaakinson.com</TD></TR>
<TR><TD>Valerie</TD><TD>Pou</TD><TD>Sea Port Record One Stop Inc</TD><TD>7475 Hamilton Blvd</TD><TD>Trexlertown</TD><TD>Lehigh</TD><TD>PA</TD><TD>18087</TD><TD>610-395-8743</TD><TD>610-395-6995</TD><TD>valerie#pou.com</TD><TD>http://www.valeriepou.com</TD></TR>
<TR><TD>Lashawn</TD><TD>Hasty</TD><TD>Kpff Consulting Engineers</TD><TD>815 S Glendora Ave</TD><TD>West Covina</TD><TD>Los Angeles</TD><TD>CA</TD><TD>91790</TD><TD>626-960-6738</TD><TD>626-960-1503</TD><TD>lashawn#hasty.com</TD><TD>http://www.lashawnhasty.com</TD></TR>
<TR><TD>Marianne</TD><TD>Earman</TD><TD>Albers Technologies Corp</TD><TD>6220 S Orange Blossom Trl</TD><TD>Orlando</TD><TD>Orange</TD><TD>FL</TD><TD>32809</TD><TD>407-857-0431</TD><TD>407-857-2506</TD><TD>marianne#earman.com</TD><TD>http://www.marianneearman.com</TD></TR>
<TR><TD>Justina</TD><TD>Dragaj</TD><TD>Uchner</TD><TD> David D Esq</TD><TD>2552 Poplar Ave</TD><TD>Memphis</TD><TD>Shelby</TD><TD>TN</TD><TD>38112</TD><TD>901-327-5336</TD><TD>901-327-2911</TD><TD>justina#dragaj.com</TD><TD>http://www.justinadragaj.com</TD></TR>
<TR><TD>Mandy</TD><TD>Mcdonnell</TD><TD>Southern Vermont Surveys</TD><TD>343 Bush St Se</TD><TD>Salem</TD><TD>Marion</TD><TD>OR</TD><TD>97302</TD><TD>503-371-8219</TD><TD>503-371-1118</TD><TD>mandy#mcdonnell.com</TD><TD>http://www.mandymcdonnell.com</TD></TR>
<TR><TD>Conrad</TD><TD>Lanfear</TD><TD>Kahler</TD><TD> Karen T Esq</TD><TD>49 Roche Way</TD><TD>Youngstown</TD><TD>Mahoning</TD><TD>OH</TD><TD>44512</TD><TD>330-758-0314</TD><TD>330-758-3536</TD><TD>conrad#lanfear.com</TD><TD>http://www.conradlanfear.com</TD></TR>
<TR><TD>Cyril</TD><TD>Behen</TD><TD>National Paper & Envelope Corp</TD><TD>1650 S Harbor Blvd</TD><TD>Anaheim</TD><TD>Orange</TD><TD>CA</TD><TD>92802</TD><TD>714-772-5050</TD><TD>714-772-3859</TD><TD>cyril#behen.com</TD><TD>http://www.cyrilbehen.com</TD></TR>
<TR><TD>Shelley</TD><TD>Groden</TD><TD>Norton</TD><TD> Robert L Esq</TD><TD>110 Broadway St</TD><TD>San Antonio</TD><TD>Bexar</TD><TD>TX</TD><TD>78205</TD><TD>210-229-3017</TD><TD>210-229-9757</TD><TD>shelley#groden.com</TD><TD>http://www.shelleygroden.com</TD></TR>
I expect you can figure out the rest since that's just printing the head and tail lines.
I'm having trouble using grep to through some html code.
I'm trying to find similar strings to this
<td>product description here</td><td> $<font color='red'>0.25</font>
i'm trying to generalize formula to count each line that is under $0.25 the parts that will vary are the:
href='/go/12229' the number after /go/ will change but always be a number 5 digits long
the product description can be alphanumeric with spaces and special characters
and the price can be anything from 0.01 to 0.25
I've tried making formulas like the one below but it either does not work or returns nothing.
grep -c "href='/go/'[*] target="_blank" rel="nofollow">*</a></td><td> $<font color='red'>[0].[0-2][0-9]</font>"
I think it has to do with me not escaping special characters correctly, but i'm not sure.
Any help is appreciated.
Okay - this requires that each line be formated as in your example, but this should give you the link, description and prices where each line is between 0.01 and 0.25. The the contents of this code an put them in a file like "priceawk" and make it executable:
grep 'go\/[0-9]\{5\}' | awk -F"<" '
{
split( $7, price_arr, ">" )
if( price_arr[ 2 ] > 0.00 && price_arr[ 2 ] < 0.26 )
{
split( $3, link_arr, "'\''" )
split( link_arr[ 3 ], desc_arr, ">" )
printf( "%s %s %s\n", link_arr[ 2 ], desc_arr[ 2 ], price_arr[ 2 ] )
}
} '
Then use it like:
cat input | priceawk
With a test input file I made from your line, I get the following kinds of output:
/go/12229 product description here 0.25
/go/13455 find this line2 0.01
/go/12334 find this line3 0.23
/go/34455 find this line4 0.16
The printf() can be improved to give your output in a different form, with a more useful delimiter than the current space.
I am using tesseract on the mac with swedish trained data, http://code.google.com/p/tesseract-ocr/downloads/detail?name=swe.traineddata.gz
I run the following from the command line:
sudo tesseract -l swe my.png my.txt
and this is what it outputs in my.txt:
uavum-rn om: mgm.
:mm om names N............
m fw.
<>..,...,.....1,». mm. ^V.m..»...1 W u
|............................ mmm
m«.......
n....... ~.«......«y.= mm
Am...
M-Q-..y...#»~.U.M........»...........
.;.§............. MYM... WU..
M. www
.<W..L.....w.m.,w»
mm... Hm... ^......... a.....ß..... M
M..
Hm... 3....
>«........
N
1
G
n.......
mmm
mmm »
mmm
MW:-u >«..«.......
M.».....«#>-ms... .a »mm »1
mm... nu .<....-...WMA _..
m........m mm
WW» m
mm w
.-...............u.
|-...M-11.”.
|........m :>...1.1-1»-.N
Kwwm
M...-«
|.~.»...:-u1.«..... ,-...........
mm M
.-M».....m ...A m...m..<....ß.-.W
.mwwm .M M»-..U..........k
.....-W... .W-;-1
Is there some parameter I miss, one I am doing wrong?
Thanks.
It's been a while since I played with this engine but your note rings a bell with me so I checked the site. I think you need to use this file Swedish language data for Tesseract 3.02 because I suspect you're using the training data from the previous release.
If I got the root of your problem, mark my answer will you? ;)