how can i extract special kind of table from website in perl? - html

I am trying to fetch all tables from the website http://finance.yahoo.com/etf/lists/?bypass=true&mod_id=mediaquotesetf&tab=tab1&scol=imkt&stype=desc&rcnt=50&page=1, using Perl module HTML::TableExtract, but I can't get the desired table; instead I get the first two tables only, which are useless to me.
Here is my code:
#!/usr/bin/perl
#!perl -w
use DBI;
use strict;
use WWW::Mechanize;
use HTML::TableExtract;
my $mech= WWW::Mechanize->new();
my $url= 'http://finance.yahoo.com/etf/lists/?bypass=true&mod_id=mediaquotesetf&tab=tab1&scol=imkt&stype=desc&rcnt=50&page=1';
$mech -> get($url);
chomp(my $script = $mech -> content);
my $table=new HTML::TableExtract();
$table->parse($script);
foreach my $ts($table->tables){
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows){
print join(',', #$row), "\n";
}
}
output:
Table (0,0):
,Search FinanceSearch Web
Table (0,1):
Quotes you view appear here for quick access.
like this I only get the first two tables instead of all of them.

The third table is generated dynamically using JavaScript. WWW::Mechanize doesn't support JavaScript, and you will need to use WWW::Mechanize::Firefox instead
Note that this will require you to install a Firefox web browser, and its mozrepl plugin, as well as the MozRepl Perl module
use strict;
use warnings 'all';
use feature 'say';
use open qw/ :std :encoding(UTF-8) /;
use WWW::Mechanize::Firefox;
use HTML::TableExtract;
use constant URL => 'http://finance.yahoo.com/etf/lists/?bypass=true&mod_id=mediaquotesetf&tab=tab1&scol=imkt&stype=desc&rcnt=50&page=1';
my $mech = WWW::Mechanize::Firefox->new;
$mech->autoclose_tab(0);
$mech->get(URL);
my $te = HTML::TableExtract->new(depth => 0, count => 2);
$te->parse( $mech->content );
for my $row ( $te->rows ) {
local $" = ',';
print "#$row\n";
}
output
ETF Name,Ticker,Category,Fund Family,Intraday Return,3-MO Return,YTD Return,1-YR Return,3-YR Return,5-YR Return
UBS ETRACS ISE Exclusively Hmbldrs ETN,HOMX,Consumer Cyclical,UBS Group AG,+13.43%,-6.74%,-6.74%,-19.54%,0.0%,0.0%
VelocityShares 3x Long Natural Gas ETN,UGAZ,Trading-Leveraged Commodities,Credit Suisse AG,+9.33%,-60.15%,-60.15%,-91.36%,-81.58%,0.0%
ProShares Ultra Bloomberg Natural Gas,BOIL,Trading-Leveraged Commodities,ProShares,+5.96%,-41.13%,-41.13%,-76.12%,-62.1%,0.0%
Direxion Daily Brazil Bull 3X ETF,BRZU,Trading-Leveraged Equity,Direxion Funds,+4.24%,+66.64%,+66.64%,-61.7%,0.0%,0.0%
DB Commodity Double Long ETN,DYY,Trading-Leveraged Commodities,Deutsche Bank AG,+4.16%,-25.87%,-25.87%,-41.98%,-34.91%,-32.8%
Deutsche X-trackers MSCI EMktsHiDvYdHgEq,HDEE,Diversified Emerging Mkts,Deutsche Asset Management,+3.73%,-2.15%,-2.15%,0.0%,0.0%,0.0%
DB Agriculture Double Long ETN,DAG,Trading-Leveraged Commodities,Deutsche Bank AG,+3.57%,+3.12%,+3.12%,-19.14%,-29.52%,-25.49%
United States Natural Gas,UNG,Commodities Energy,United States Commodity Funds LLC,+3.15%,-23.18%,-23.18%,-49.7%,-32.73%,-32.09%
Direxion Daily Jr Gld Mnrs Bear 3X ETF,JDST,Trading-Inverse Equity,Direxion Funds,+3.03%,-80.83%,-80.83%,-88.33%,0.0%,0.0%
Direxion Daily S&P Biotech Bull 3X ETF,LABU,Trading-Leveraged Equity,Direxion Funds,+2.97%,-67.51%,-67.51%,0.0%,0.0%,0.0%
VelocityShares 3x Inverse Silver ETN,DSLV,Trading-Inverse Commodities,Credit Suisse AG,+2.88%,-36.11%,-36.11%,-15.03%,+13.64%,0.0%
ProShares Ultra MSCI Brazil Capped,UBR,Trading-Leveraged Equity,ProShares,+2.85%,+51.79%,+51.79%,-37.19%,-42.04%,-38.48%
Direxion Daily India Bull 3X ETF,INDL,Trading-Leveraged Equity,Direxion Funds,+2.71%,-9.82%,-9.82%,-48.84%,-13.11%,-22.69%
Direxion Daily Real Estate Bull 3X ETF,DRN,Trading-Leveraged Equity,Direxion Funds,+2.66%,+15.65%,+15.65%,-0.71%,+21.8%,+22.37%
iShares US Telecommunications,IYZ,Communications,iShares,+2.63%,+7.43%,+7.43%,+3.79%,+10.74%,+7.88%
ProShares Ultra Semiconductors,USD,Trading-Leveraged Equity,ProShares,+2.58%,-4.09%,-4.09%,-6.65%,+31.6%,+15.12%
Direxion Daily Pharmctcl&Medcl Bl 2X ETF,PILL,Trading-Leveraged Equity,Direxion Funds,+2.57%,-27.81%,-27.81%,0.0%,0.0%,0.0%
IQ Hedge Event-Driven Tracker ETF,QED,Market Neutral,IndexIQ,+2.54%,+1.51%,+1.51%,-1.64%,0.0%,0.0%
Direxion Daily Regional Bnks Bull 3X ETF,DPST,Trading-Leveraged Equity,Direxion Funds,+2.51%,-20.73%,-20.73%,0.0%,0.0%,0.0%
VelocityShares 3x Inverse Gold ETN,DGLD,Trading-Inverse Commodities,Credit Suisse AG,+2.44%,-40.19%,-40.19%,-24.19%,+6.6%,0.0%
Direxion Daily South Korea Bull 3X ETF,KORU,Trading-Leveraged Equity,Direxion Funds,+2.43%,+12.48%,+12.48%,-30.37%,0.0%,0.0%
VelocityShares Daily Inverse VIX ST ETN,XIV,Volatility,Credit Suisse AG,+2.43%,+0.31%,+0.31%,-25.29%,+3.55%,+13.26%
ProShares Ultra S&P Regional Banking,KRU,Trading-Leveraged Equity,ProShares,+2.42%,-22.89%,-22.89%,-17.41%,+10.48%,+9.61%
Global X FTSE Andean 40 ETF,AND,Latin America Stock,Global X Funds,+2.38%,+12.8%,+12.8%,-15.28%,-19.56%,-11.0%
AccuShares Spot CBOE® VIX® ETC Down,VXDN,Volatility,AccuShares™,+2.32%,-12.72%,-12.72%,0.0%,0.0%,0.0%
ProShares Short S&P Regional Banking,KRS,Trading-Inverse Equity,ProShares,+2.3%,+7.46%,+7.46%,-0.5%,-12.4%,-14.52%
United States 12 Month Natural Gas,UNL,Commodities Energy,United States Commodity Funds LLC,+2.3%,-8.84%,-8.84%,-29.88%,-22.73%,-23.91%
ProShares Short VIX Short-Term Futures,SVXY,Volatility,ProShares,+2.28%,+0.16%,+0.16%,-25.73%,+3.53%,0.0%
SPDR® S&P Transportation ETF,XTN,Industrials,SPDR State Street Global Advisors,+2.24%,+7.3%,+7.3%,-12.73%,+12.63%,+12.34%
iShares MSCI UAE Capped,UAE,Miscellaneous Region,iShares,+2.22%,+5.31%,+5.31%,-4.98%,0.0%,0.0%
VelocityShares 3x Inverse Crude Oil ETN,DWTI,Trading-Inverse Commodities,Credit Suisse AG,+2.21%,-20.19%,-20.19%,+19.62%,+56.72%,0.0%
ProShares Ultra High Yield,UJB,Trading-Leveraged Debt,ProShares,+2.19%,+10.09%,+10.09%,-6.81%,+1.36%,0.0%
iPath® Bloomberg Livestock SubTR ETN,COW,Commodities Agriculture,Barclays Funds,+2.19%,+0.94%,+0.94%,-10.57%,-3.1%,-5.92%
iPath® Bloomberg Natural Gas SubTR ETN,GAZ,Commodities Energy,Barclays Funds,+2.19%,-31.94%,-31.94%,-59.17%,-44.91%,-49.91%
Direxion Daily Small Cap Bull 3X ETF,TNA,Trading-Leveraged Equity,Direxion Funds,+2.16%,-8.7%,-8.7%,-35.41%,+10.14%,+6.17%
ProShares Ultra Utilities,UPW,Trading-Leveraged Equity,ProShares,+2.16%,+30.83%,+30.83%,+28.99%,+22.85%,+24.31%
SPDR® Wells Fargo Preferred Stock ETF,PSK,Preferred Stock,SPDR State Street Global Advisors,+2.14%,+1.77%,+1.77%,+5.83%,+5.83%,+6.08%
ProShares UltraShort Silver,ZSL,Trading-Inverse Commodities,ProShares,+2.14%,-23.44%,-23.44%,-1.99%,+21.56%,-2.97%
ProShares UltraPro Russell2000,URTY,Trading-Leveraged Equity,ProShares,+2.14%,-8.53%,-8.53%,-34.81%,+10.76%,+7.06%
Direxion Daily Emrg Mkts Bull 3X ETF,EDC,Trading-Leveraged Equity,Direxion Funds,+2.11%,+13.64%,+13.64%,-44.79%,-25.84%,-28.1%
DB Agriculture Short ETN,ADZ,Trading-Inverse Commodities,Deutsche Bank AG,+2.1%,-11.59%,-11.59%,-5.04%,+11.94%,+7.62%
DB 3x Long 25+ Year Treasury Bond ETN,LBND,Trading-Leveraged Debt,Deutsche Bank AG,+2.08%,+22.67%,+22.67%,+2.15%,+11.31%,+24.21%
PureFunds ISE Cyber Security™ ETF,HACK,Technology,Pure Funds,+2.02%,-7.45%,-7.45%,-14.3%,0.0%,0.0%
ProShares UltraPro MidCap400,UMDD,Trading-Leveraged Equity,ProShares,+1.95%,+6.04%,+6.04%,-19.63%,+20.32%,+16.2%
ProShares Ultra Telecommunications,LTL,Trading-Leveraged Equity,ProShares,+1.79%,+13.74%,+13.74%,+0.87%,+18.18%,+11.67%
Direxion Daily Hmbldrs&Supls Bull 3X ETF,NAIL,Trading-Leveraged Equity,Direxion Funds,+1.79%,-10.32%,-10.32%,0.0%,0.0%,0.0%
Teucrium Wheat ETF,WEAT,Commodities Agriculture,Teucrium,+1.78%,-1.54%,-1.54%,-17.67%,-21.21%,0.0%
Vanguard Telecommunication Services ETF,VOX,Communications,Vanguard,+1.76%,+11.11%,+11.11%,+11.82%,+11.64%,+9.99%
Direxion Daily Financial Bull 3X ETF,FAS,Trading-Leveraged Equity,Direxion Funds,+1.75%,-14.79%,-14.79%,-18.93%,+21.63%,+14.48%
US Global Jets ETF,JETS,Miscellaneous Sector,U.S. Global Investors,+1.75%,+1.77%,+1.77%,0.0%,0.0%,0.0%

Related

Beautiful soup - find_all function is returning returning only 20 items from the page. The actual results are around 250

I am using find_all in beautiful soup library to parse the HTML text.
code
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
response = get(URL, headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
len(html_soup)
This is returning only 20 items even though the page shows 250 results. What am I doing wrong here ?
Try (This takes all (291)):
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
driver.get(URL)
driver.maximize_window()
PAUSE_TIME = 2
lh = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(PAUSE_TIME)
nh = driver.execute_script("return document.body.scrollHeight")
if nh == lh:
break
lh = nh
articles = driver.find_elements_by_css_selector('.css-h7k7mr')
for article in articles:
print(article.text)
print('-' * 80)
driver.close()
prints:
₹45.11 L
EMI starts at ₹28.13 K
3 BHK Apartment
Bachupally, Nizampet, Hyderabad
Build Up Area
1556 sq.ft
Avg. Price
₹2.90 K/sq.ft
Special Highlights
24x7 Security
Badminton Court
Cycling & Jogging Track
Gated Community
3 BHK Apartment available for sale in Bachapally,hyderabad,beside Mama Medical College, Nizampet, Hyderabad. Available amenities are: Gym, Swimming pool, Garden, Kids area, Sports facility, Lift. Apartment has 3 bedroom, 2 bathroom.
Read more
M Srikanth
Housing Prime Agent
Contact
--------------------------------------------------------------------------------
₹37.96 L - 62.05 L
EMI starts at ₹23.67 K
Bhuvanteza Evk Aura
Marketed by Sri Avani Infra Projects
Kollur, Hyderabad
Configurations
2, 3 BHK Apartments
Possession Starts
Nov, 2022
Avg. Price
₹3.65 K/sq.ft
Real estate developer Bhuvanteza Infrastructures has launched prime housing project Evk Aura in Kollur, Hyderabad. The project is offering beautiful and comfortable 2 and 3 BHK apartments for sale. Built-up area for 2 BHK apartments is in the range of 1040 to 1185 sq ft. and for 3 BHK apartments it is 1700 sq ft. Amenities which are required for a comfortable living will be available in the complex, they are car parking, club house, swimming pool, children play area, power backup and others. Developer Bhuvanteza Infrastructures can be contacted for owning an apartment in Evk Aura. Kollur is a ...
Read more
SA
Sri Avani Infra Projects
Seller
Contact
--------------------------------------------------------------------------------
and so on....
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
you're not reading right, there are 250 results in total but only 20 are shown, that's why you get 20 in python

'utf-8' codec can't decode byte 0xca in position 972: invalid continuation byte: using os, shutil, dictionary to move files

i have a question and i dont know if i can execute it on python
so i have files in the form of cik number-year.txt
and i have created directories of filenames matching firm names
and i have a spreadsheet of filename matching cik
I wrote a piece of code that should perform what I described, but i ran into an error which states UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 972: invalid continuation byte
Here is the csv file that im trying to read
company name ,cik,cik 2 ,missing ,notes
Deere,315189,,,
H.J. Heinz,46640,1637459,,
Bestfoods,25350,,,
Bayer Corporation,,,,
Sealed Air,1012100,,missing 1994-1997,
Eli Lilly,59478,,,
Campbell Soup,16732,,"missing 96, 97, 98, 99,00, 01",
Honeywell,48305,773840,overlapping years as have honeywell central and honeywell international,
Kellogg,55067,,,
Union Carbide,100790,,"missing 10-ks from 1993-1996, 2001 ",
Cooper Industries,24454,1141982,"missing 1993-2002, 2013-2018",
North American Philips,,,,
Intel,50863,,,
Amerada Hess,4447,,"missing 95, 97, 00",
Martin Marietta,916076,,1994-2002 ,
PPG Industries,79879,,"1997, 2000, 2001",
Litton Industries,59880,,"1995, 2001-2018",
Reynolds Metals,83604,,"95, 96, 97, 98, 00-18",
Warner-Lambert,104669,,"96, 97, 01-18",
Quaker Oats,81371,,,
Levi Strauss,94845,,before 2000,
Northrop Grumman,72945,,2002-2018,
Stone Container,94610,,,
LTV,,,,
American Cyanamid,4829,,,AMERICAN CYANAMID CO merged with American Home Products (AHP) in 1994.
Gillette,41499,,"96-98, 06-18","On October 1, 2005, Procter & Gamble finalized its merger with the Gillette Company."
Johnson Controls,53669,,"97, 98,99, 00, 01, 17-18 ",
Coca-Cola Enterprises,1491675,,"before 2011, after 2016 ",
BASF,,,,
Dana,26780,,95,
Champion International,19150,,nearly no data apart from 95,
Scott Paper,87949,,,The company was acquired by the Kimberly-Clark Corporation in 1995
Lyondell Chemical,842635,1489393,"95-98,09-18","LyondellBasell was formed in December 2007 by the acquisition of Lyondell Chemical Company by Basell Polyolefins for $12.7 billion.[7] As of 2016, Lyondell was the third largest independent chemical manufacturer in the United States.[8]"
Black & Decker,93556,12355,"00,01 ",2010 – Black & Decker merges with Stanley Works to become Stanley Black & Decker
Fort James,53117,,,"In 1997, the Fort Howard Paper Company and the James River Corporation merged to form the Fort James Corporation.[1][4] Fort Howard was headquartered in Green Bay and James River in Richmond,In 2000, the Fort James Corporation was acquired by Georgia-Pacific for $11 billion;[1][4] GP is based in Atlanta, Georgia. Virginia."
Mead,64394,,"missing 2001+, strange. ","missing 2005+, strange. "
Chiquita Brands Intl.,101063,,"98,01,06-18",
Dresser Industries,30099,,,"In 1998, Dresser merged with its main rival Halliburton,[1] Halliburton sold many of former Dresser non ""oil patch"" divisions, retaining the M W Kellogg Engineering and Construction Company and the Dresser oil-patch products and services that complemented Halliburton's energy and natural resource businesses. In 2001 Halliburton sold five separate, but somewhat related former Dresser non ""oil patch"" divisions, to an investment banking firm. Those five operations later took the name ""Dresser Inc."" In October 2010, Dresser Inc., was acquired by General Electric.[2] It is headquartered in Addison, Texas.[3]"
R.R. Donnelley & Sons,29669,,"95,97-02",
Tyson Foods,100493,,"95-99,01",
Compaq Computer,714154,,"96, 97","Struggling to keep up in the price wars against Dell, as well as with a risky acquisition of DEC,[4] Compaq was acquired for US$25 billion by HP in 2002"
J.E. Seagram,,,,Seagram was sold to French conglomerate Vivendi in 2000.
Rhone-Poulenc Rorer,217028,1325676,,In 1999 it merged with Hoechst AG to form Aventis.
Eaton,1490873,,,
Schering-Plough,310158,,"97, 98, 99, 00"," On November 4, 2009 Merck & Co. merged with Schering-Plough with the new company taking the name of Merck & Co."
Bethlehem Steel,11860,,,"After a decline in the American steel industry and other problems leading to the company's bankruptcy in 2001, the Bethlehem Steel Corporation was dissolved and the remaining assets sold to International Steel Group in 2003; Bethlehem Steel Corporation did not merge with/into International Steel Group."
FMC,37785,,"95, 2000,2001 ",
Navistar International,808450,,-,
VF,,,,
Avon Products,8868,,-,
American Standard,836102,,"95-97, 09-19 ",
Ingersoll-Rand,1466258,50485,43532,
Crown Holdings,1219601,,34001,
Cummins,26172,,,
Corning,24741,,1,
OfficeMax,929428,,"only have 97, 02, 03",
Pharmacia,12978,,"95, 96.00.01.02.14-19 ",
Owens-Illinois,812233,812074,,This may refer to OWENS & MINOR INC/VA/ or OWENS ILLINOIS INC /DE/
AMAX,,,,
Times Mirror,98349,925260,95. 99,Times Mirror Co. was acquired by the Tribune Company in 2000
Sun Microsystems,709519,,"95, 00, 01, 10-19","On April 20, 2009, it was announced that Oracle Corporation would acquire Sun for US$7.4 billion. The deal was completed on January 27, 2010.[3]"
Masco,62996,,,
Grumman,1133421,,34001,
Ryerson Tull,790528,1013595,"98-02, 15-18",
Gannett,39899,,,"In 2015, Gannett Co., Inc., spun off its publishing business into a separate publicly traded entity, while retaining the internet media divisions. Immediately following the spin off, the former parent Company (Gannett Co., Inc.) renamed itself Tegna and owns approximately 50 TV stations. The spun-off publishing business renamed itself ""Gannett"""
Pitney Bowes,78814,,"00,02 ",
Farmland Industries,34616,,, sold all of its assets in 2002–04
FINA,,,,
Kerr-McGee,55458,1141185,"95.96.01,07-19",
AMP,1242513,,"before 2009 and after 2014, 2012",
Agway,2852,,only have 94 and 02," On October 1, 2002 the company filed for Chapter 11 bankruptcy"
Air Products & Chem.,2969,,95-00,
Hershey Foods,47111,,,
Varity,63118,,only have 94 and 96,"In March 1999, LucasVarity was purchased by US automotive company TRW.[7]"
Rohm & Haas,84792,,43757, Dow Chemical Company bought Rohm and Haas for $15 billion in 2009
Tyco International,833444,,94-96,"On January 25, 2016, Johnson Controls announced it would merge with Tyco, and all businesses of Tyco and Johnson Controls would be combined under Tyco International plc, to be renamed as Johnson Controls International plc."
Union Camp,100783,,96,In 1999 it was acquired by International Paper.
Harris,202058,,"96, 97,01",
Maytag,63541,,01-02;07-19 ,The Maytag Corporation is an American home and commercial appliance brand owned by Whirlpool Corporation after the April 2006 acquisition of Maytag.
Berkshire Hathaway,109694,,99-2018,
Smurfit-Stone Container,94610,727742,,"SSCC was formed in November 1998, with the merger of Jefferson Smurfit Corporation (JSC) and Stone Container Corporation (Stone).I have also included the Smurfit Corporation cik here "
Universal,102037,,"95,97,98,00,01",
Ethyl,33656,,1,"In 2004, Ethyl Corporation became a subsidiary of NewMarket Corporation (NYSE: NEU)."
Premark International,800575,,missing 00-19 ,
Teledyne,1094285,,missing before 2002,
Seagate Technology,1137789,354952,,
Loral,1029850,1006269,,
Hercules,1280784,46989,," 2008, when it was merged into Ashland Inc."
Owens Corning,75234,1370946,,
Illinois Tool Works,49826,,,
Hormel Foods,48465,,,
PerkinElmer,31791,77551,,
Paccar,75362,,,
Sherwin-Williams,89800,,,
Pennzoil,77320,,"only have 94, 97, 98 ",
Temple-Inland,731939,,,
Readers Digest Assn.,858558,,,
Mapco,62142,,,
Avery Dennison,8818,,,
Diamond Shamrock,810316,,,
Ultramar Diamond Shamrock,887207,,,
Phelps Dodge,78066,,,
Land OLakes,1032562,,,
AMDAHL,4427,,, been a wholly owned subsidiary of Fujitsu since 1997.
Armstrong Holdings,1109304,,,
Baker Hughes,808362,1701605,,
Hasbro,46080,,,
Goodrich,42542,,,
Ball,9389,,,
Engelhard,352947,,,
Total Petroleum,,,,
Whitman,49573,1084230,,
Olin,74303,,,
Parker Hannifin,76334,,,
National Steel,70578,1231868,,
McDermott,708819,,,
Willamette Industries,107189,,,"In 2002, the lumber and paper company was purchased by competitor Weyerhaeuser of Federal Way, Washington in a hostile buyout and merged into Weyerhaeuser's existing operations."
Becton Dickinson,10795,,,
Westvaco,106498,1159297,,
Knight-Ridder,205520,,,"bought by McClatchy on June 27, 2006"
Quantum Chemical,,,,
Dean Foods,931336,,,
Dover,29905,,,
Intl. Multifoods,51410,,,"*cant find M&A records on wiki, 2005 onwards data missing"
Conner Peripherals,792397,,,"In 1996, Conner Peripherals was acquired by Seagate."
Premcor,1159119,,,Premcor was acquired by Valero in 2005. 
Maxxam,63814,,,
Manville,355473,,,
Brunswick,14930,,,
Collins & Aikman,1037123,846815,,
Stanley Works,93556,,,
Louisiana-Pacific,60519,,,
Polaroid,79326,1227728,," Polaroid Corporation was declared bankrupt in 2001, its brand and assets were sold off. The ""new"" Polaroid formed as a result, and itself declared bankruptcy in 2008, resulting in a further sale and in the present-day Polaroid Corporation"
Tosco,74091,,,"Tosco merged with Phillips Petroleum in 2001. Phillips merged with Conoco in 2002 to become ConocoPhillips, who spun off the Circle K stores to Canadian-based Alimentation Couche-Tard."
Tribune,726513,,,
E-SYSTEMS,,,,"In 1995, Raytheon Company acquired E-Systems, Inc."
ARMCO,7383,,,"In 1999, AK steel holding acquired Armco Inc., its former parent company, for $1.3 billio"
Burlington Industries Equity,870213,,,Its assets were acquired by International Textile Group (ITG) out of bankruptcy in late 2003
Tandem Computers,315180,,,
McGraw-Hill,64040,,,
Springs Industries,93102,,,"On June 27, 2007 Springs said that after 120 years, Springs would end manufacturing in South Carolina with the closing of its Grace and Close plants. The state would still have about 700 employees, most of them at distribution centers in Lancaster and Fort Lawn, and at the Fort Mill offices"
Molson Coors Brewing,24545,,,
Dow Corning,29917,,,"Following the December 11, 2015 announcement that it would merge with DuPont, "
York International,842662,,,"The York brand has been owned since August 2005 by Johnson Controls, when it was sold to them for $3.2 billion."
GenCorp,40888,,,
Asarco,7649,,,"In 1999 it was acquired by Grupo México, which had begun as Asarco's 49%-owned Mexican subsidiary in 1965."
Morton International,1035972,,,
Wang Laboratories,,,,"10-k available online, but somehow not in directory"
Central Soya,,,,
Arvin Industries,7636,,,
Pet,888455,,,missing 10-ks from 2007 onwards
Mattel,63276,,,
MID-AMERICA DAIRYMEN,789868,,,
Sequa,95301,,,
Fruit of the Loom,1053303,,,
Sonoco Products,91767,,,
Dow Jones,29924,,,2007 when an extended takeover battle saw News Corp take control of the company
Rubbermaid,814453,85627,,
Echlin,31348,,,"data available till 1996, no info traceable on internet"
USG,757011,,,
CENEX,823277,,,
New York Times,71691,,,
Shaw Industries,89498,,,"On January 4, 2001, under the guidance of CEO and President W. Norris Little, Sr. and CEO Bob Shaw, Shaw Industries was sold to Berkshire Hathaway Inc."
Witco,107889,,,
National Semiconductor,70530,,,
Imcera Group,,,,https://en.wikipedia.org/wiki/Mallinckrodt
Bausch & Lomb,10427,,,
Clorox,21076,,,
Sundstrand,95395,,,"Hamilton Sanstrand company was formed from the merger of Hamilton Standard and Sundstrand Corporation in 1999. In 2012, Hamilton Sundstrand was merged with Goodrich Corporation to form UTC Aerospace Systems. No evidence of Hamilton Sanstrand can be found in the directory "
Aeroquip-Vickers,59198,,,"On February 1, 1999, Eaton and Aeroquip-Vickers jointly announced that Eaton would acquire all of the outstanding common shares of Aeroquip-Vickers for $58 per share in cash, or approximately $1.7 billion."
Murphy Oil,717423,,,
Metaldyne,745448,1616817,,"Metaldyne Performance Group Inc.'s majority owner, American Securities LLC, and its affiliates acquired HHI Holdings in October 2012; Metaldyne, LLC in December 2012"
Burlington Resources,,,,
Freeport-McMoran,,,,
Cyprus Amax Minerals,,,,
Timken,,,,
National Service Industries,,,,
Harsco,,,,
General Signal,,,,
Nucor,,,,
Duracell International,,,,
Fleetwood Enterprises,,,,
Storage Technology,,,,
Newell Rubbermaid,,,,
Crown Central,,,,
American Greetings,,,,
Cabot,,,,
Lubrizol,,,,
Reliance Electric,,,,
Deluxe,,,,
Advanced Micro Devices,,,,
Lafarge,,,,
WestPoint Stevens,,,,
Great Lakes Chemical,,,,
Bowater,,,,
Nacco Industries,,,,
McCormick,,,,
Furniture Brands Intl.,,,,
Washington Post,,,,
Federal Paper Board,,,,
Hillenbrand Industries,,,,
Del Monte Foods,,,,
Lear,,,,
Joy Global,,,,
Nalco Chemical,,,,
Coltec Industries,,,,
Walter Industries,,,,
M.A. Hanna,,,,
Potlatch,,,,
Thiokol,,,,
Oryx Energy,,,,
Gold Kist Holdings,,,,
Crane,,,,
Wm. Wrigley Jr.,,,,
Great American Mgmt. & Inv.,,,,
Tektronix,,,,
Raychem,,,,
Dresser-Rand,,,,
Gerber Products,,,,
Varian Associates,,,,
Tecumseh Products,,,,
Rohr,,,,
My codes:
import csv
import os
import shutil
with open('missingfiles1.csv', 'r') as f:
os.chdir('/Users/lucy/Desktop/summer/datacollection/10-X_C_1993-2000/1994/QTR1')
reader=csv.reader(f)
company=()
for row in reader:
company[row[0]]={'cik':row[1]}
for f in oslistdir():
filename, filetype = os.path.splitext(f)
fn = filename.split('-')
if cik==fn[0]:
os.path.join('/Users/lucy/Desktop/summer/Summarydatafile',company)
shutil.move(f,os.path)
os.chdir('/Users/lucy/Desktop/summer/datacollection/missingfiles.csv')
I had the same issue but I am using package pandas.
Import pandas as pd
.
.
.
.
df = pd.read_csv('C:/Users/melissa/Documents/APIlist.csv', header=0, encoding='unicode_escape')
Adding the 'header=0' and encode the bytes to 'unicode escape' did the trick.

Trouble converting a fixed-width file into a csv

sorry if this is a newbie question, but I didn't find the answer to this particular question on stackoverflow.
I have a (very large) fixed-width data file that looks like this:
simplefile.txt
ratno fdate ratname typecode country
12346 31/12/2010 HARTZ 4 UNITED STATES
12444 31/12/2010 CHRISTIE 5 UNITED STATES
12527 31/12/2010 HILL AIR 4 UNITED STATES
15000 31/12/2010 TOKUGAVA INC. 5 JAPAN
37700 31/12/2010 HARTLAND 1 UNITED KINGDOM
37700 31/12/2010 WILDER 1 UNITED STATES
18935 31/12/2010 FLOWERS FINAL SERVICES INC 5 UNITED STATES
37700 31/12/2010 MAPLE CORPORATION 1 CANADA
48614 31/12/2010 SERIAL MGMT L.P. 5 UNITED STATES
1373 31/12/2010 AMORE MGMT GROUP N A 1 UNITED STATES
I am trying to convert it into a csv file using the terminal (the file is too big for Excel) that would look like this:
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
15000,31/12/2010,TOKUGAVA INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED KINGDOM
37700,31/12/2010,WILDER,1,UNITED STATES
18935,31/12/2010,FLOWERS FINAL SERVICES INC,5,UNITED STATES
37700,31/12/2010,MAPLE CORPORATION,1,CANADA
48614,31/12/2010,SERIAL MGMT L.P.,5,UNITED STATES
1373,31/12/2010,AMORE MGMT GROUP N A,1,UNITED STATES
I dug a bit around on this site and found a possible solution that relies on the awk shell command:
awk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}' "simpletestfile.txt"
However, when I execute the above command in the terminal, it inadvertently also inserts commas in all white spaces, inside the separate words of what is supposed to remain a single field. The result of the above execution is as follows:
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED,STATES
12444,31/12/2010,CHRISTIE,5,UNITED,STATES
12527,31/12/2010,HILL,AIR,4,UNITED,STATES
15000,31/12/2010,TOKUGAVA,INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED,KINGDOM
37700,31/12/2010,WILDER,1,UNITED,STATES
18935,31/12/2010,FLOWERS,FINAL,SERVICES,INC,5,UNITED,STATES
37700,31/12/2010,MAPLE,CORPORATION,1,CANADA
48614,31/12/2010,SERIAL,MGMT,L.P.,5,UNITED,STATES
1373,31/12/2010,AMORE,MGMT,GROUP,N,A,1,UNITED,STATES
How can I avoid inserting commas in white spaces outside of delineated fieldwidths? Thank you!
Your attempt was good, but requires gawk (gnu awk) for the FIELDWIDTHS built-in variable. With gawk:
$ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}' file
ratno, fdate, ratname , typecode, country
12346, 31/12/2010, HARTZ , 4 , UNITED STATES
12444, 31/12/2010, CHRISTIE , 5 , UNITED STATES
12527, 31/12/2010, HILL AIR , 4 , UNITED STATES
Assuming you don't want the extra spaces, you can do instead:
$ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{for (i=1; i<=NF; ++i) gsub(/^ *| *$/, "", $i)}1' file
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
If you don't have gnu awk, you can achieve the same results with:
$ awk -v fieldwidths="5 11 31 9 16" '
BEGIN { OFS=","; split(fieldwidths, widths) }
{
rec = $0
$0 = ""
start = 1;
for (i=1; i<=length(widths); ++i) {
$i = substr(rec, start, widths[i])
gsub(/^ *| *$/, "", $i)
start += widths[i]
}
}1' file
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
perl is handy here:
perl -nE ' # read this bottom to top
say join ",",
map {s/^\s+|\s+$//g; $_} # trim leading/trailing whitespace
/^(.{5}) (.{10}) (.{30}) (.{8}) (.*)/ # extract the fields
' simplefile.txt
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
15000,31/12/2010,TOKUGAVA INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED KINGDOM
37700,31/12/2010,WILDER,1,UNITED STATES
18935,31/12/2010,FLOWERS FINAL SERVICES INC,5,UNITED STATES
37700,31/12/2010,MAPLE CORPORATION,1,CANADA
48614,31/12/2010,SERIAL MGMT L.P.,5,UNITED STATES
1373,31/12/2010,AMORE MGMT GROUP N A,1,UNITED STATES
Although, for proper CSV, we need to be a bit cautious about fields containing commas or quotes. If I was feeling less secure about the contents of the file, I'd use this map block:
map {s/^\s+|\s+$//g; s/"/""/g; qq("$_")}
which outputs
"ratno","fdate","ratname","typecode","country"
"12346","31/12/2010","HARTZ","4","UNITED STATES"
"12444","31/12/2010","CHRISTIE","5","UNITED STATES"
"12527","31/12/2010","HILL AIR","4","UNITED STATES"
"15000","31/12/2010","TOKUGAVA INC.","5","JAPAN"
"37700","31/12/2010","HARTLAND","1","UNITED KINGDOM"
"37700","31/12/2010","WILDER","1","UNITED STATES"
"18935","31/12/2010","FLOWERS FINAL SERVICES INC","5","UNITED STATES"
"37700","31/12/2010","MAPLE CORPORATION","1","CANADA"
"48614","31/12/2010","SERIAL MGMT L.P.","5","UNITED STATES"
"1373","31/12/2010","AMORE MGMT GROUP N A","1","UNITED STATES"

Merging a weird html-like txt file with an Excel file

I got two files which I'm supposed to merge (most likely using statistical software such as R or SPSS), one of them being a normal Excel table with 3 variables (names at the top of the columns). The second one, however, was sent to me in a format I haven't seen before, a large txt file with input per case (identified with the ID variable, which I would also use to merge with the Excel file) which looks like this:
<organizations>
<organization id="B0101">
<type1>E</type1>
<type2>v</type2>
<name>International Association for Official Statistics</name>
<acronym>IAOS</acronym>
<country_first_address>not known</country_first_address>
<city_first_address>not known</city_first_address>
<countries_in_which_members_located>not known</countries_in_which_members_located>
<subject_headings>Government; Statistics</subject_headings>
<foundation_year>1985</foundation_year>
<history>[[History]] Founded 1985, Amsterdam (Netherlands), at 45th Session of #A2590, as a specialized section of ISI. Absorbed, 1989, #D1316, which had been set up 22 Oct 1958, Geneva (Switzerland), following recommendations of ISI, as [International Association of Municipal Statisticians -- Association internationale de statisticiens municipaux]. </history>
<history_relations>#A2590; #D1316</history_relations>
<consultative_status>none known</consultative_status>
<igo_relations>none known</igo_relations>
<ngo_relations>#E1209; #M4975; #D1976; #E2125; #E3673; #D2578; #M0084</ngo_relations>
<member_organizations>none known</member_organizations>
</organization>
<organization id="B8500">
<type1>B</type1>
<type2>y</type2>
<name>World Blind Union</name>
<acronym>WBU</acronym>
<country_first_address>Canada</country_first_address>
<city_first_address>Toronto</city_first_address>
<countries_in_which_members_located>Algeria; Angola; Benin; Burkina Faso; Burundi; Cameroon; Cape Verde; Central African Rep; Chad; Congo Brazzaville; Congo DR; Côte d'Ivoire; Djibouti; Egypt; Equatorial Guinea; Eritrea; Ethiopia; Gabon; Gambia; Ghana; Guinea; Guinea-Bissau; Kenya; Lesotho; Liberia; Libyan AJ; Madagascar; Malawi; Mali; Mauritania; Mauritius; Morocco; Mozambique; Namibia; Niger; Nigeria; Rwanda; Sao Tomé-Principe; Senegal; Seychelles; Sierra Leone; Somalia; South Africa; South Sudan; Sudan; Swaziland; Tanzania UR; Togo; Tunisia; Uganda; Zambia; Zimbabwe; Anguilla; Antigua-Barbuda; Argentina; Bahamas; Barbados; Belize; Bolivia; Brazil; Canada; Chile; Colombia; Costa Rica; Cuba; Dominica; Dominican Rep; Ecuador; El Salvador; Grenada; Guatemala; Guyana; Haiti; Honduras; Jamaica; Martinique; Mexico; Montserrat; Nicaragua; Panama; Paraguay; Peru; St Kitts-Nevis; St Lucia; St Vincent-Grenadines; Trinidad-Tobago; Turks-Caicos; Uruguay; USA; Venezuela; Virgin Is UK; Afghanistan; Bahrain; Bangladesh; Brunei Darussalam; Cambodia; China; Hong Kong; India; Indonesia; Iraq; Israel; Japan; Jordan; Kazakhstan; Korea Rep; Kuwait; Kyrgyzstan; Laos; Lebanon; Macau; Malaysia; Mongolia; Myanmar; Nepal; Pakistan; Philippines; Qatar; Singapore; Sri Lanka; Syrian AR; Taiwan; Tajikistan; Thailand; Timor-Leste; Turkmenistan; United Arab Emirates; Uzbekistan; Vietnam; Yemen; Australia; Fiji; New Zealand; Tonga; Albania; Armenia; Austria; Azerbaijan; Belarus; Belgium; Bosnia-Herzegovina; Bulgaria; Croatia; Cyprus; Czech Rep; Denmark; Estonia; Finland; France; Georgia; Germany; Greece; Hungary; Iceland; Ireland; Italy; Latvia; Lithuania; Luxembourg; Macedonia; Malta; Moldova; Montenegro; Netherlands; Norway; Poland; Portugal; Romania; Russia; Serbia; Slovakia; Slovenia; Spain; Sweden; Switzerland; Turkey; UK; Ukraine;</countries_in_which_members_located>
<subject_headings>Blind, Visually Impaired</subject_headings>
<foundation_year>1984</foundation_year>
<history>[[History]] Founded 26 Oct 1984, Riyadh (Saudi Arabia), as one united world body composed of representatives of national associations of the blind and agencies serving the blind, successor body to both #B3499, set up 20 July 1951, Paris (France), and #B2024, formed in Aug 1964, New York NY (USA). Constitution adopted 26 Oct 1984; amended at: 3rd General Assembly, 2-6 Nov 1992, Cairo (Egypt); 26-30 Aug 1996, Toronto (Canada); 20-24 Nov 2000, Melbourne (Australia); 22-26 Nov 2004, Cape Town (South Africa); 18-22 Aug 2008, Geneva (Switzerland); 12-16 Nov 2012, Bangkok (Thailand). Registered in accordance with French law, 20 Dec 1984, Paris and again 20 Dec 2004, Paris. Incorporated in Canada as not-share-capital not-for-profit corporation, 16 Mar 2007. </history>
<history_relations>#B3499; #B2024</history_relations>
<consultative_status>#E3377; #B2183; #B3548; #B0971; #F3380; #B3635</consultative_status>
<igo_relations>#E7552; #F1393; #A3375; #B3408</igo_relations>
<ngo_relations>#E0409; #E6422; #J5215; #F5821; #C1224; #D5392; #F6792; #A1945; #B2314; #D1758; #F5810; #D1612; #J0357; #D1038; #G6537; #B2221; #B0094; #B3536; #D7556</ngo_relations>
<member_organizations>#F6063; #F4959; #J1979; #C1224; #B0094; #D5392; #A1945; #D2362; #F2936; #J4730; #F3167; #D8743; #F1898; #D0043; #G0853</member_organizations>
</organization>
Any help would be appreciated - what type of file this is and how to transform it into a manageable table?
I think your data is XML. I copied your sample data, pasted it into a blank file, and saved it as sample.xml. I made sure to add in a line with </organizations> at the very end (line 37 in your sample), to close off that tag.
Then I followed the instructions here to read it in:
library(XML)
xmlfile <- xmlTreeParse(file = "sample.xml")
xmltop = xmlRoot(xmlfile)
orgs <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
orgs_df <- data.frame(t(orgs),row.names=NULL)
This returns a dataframe orgs_df with 2 obs. of 15 variables. I presume you can now go ahead and merge this with your Excel file as you please.

Text manipulation with sed

I need a little help, in our class we've been playing around with GREP and SED commands in an attempt to learn how they work. More specifically we've been using sed commands to manipulate text and add tags.
So, we we're given an assignment, we've been given 500 lines of CSV fake data and it is our job to create a sed command that will automatically tag the data and tag any new data added down the road (theoretically).
Here's a few lines of our fake UN-TAGGED data, this is by default how we received it, as you can see all the data starts with a first name and ends with a web email:
FirstName,LastName,Company,Address,City,County,State,ZIP,Phone,Fax,Email,Web
"Essie","Vaill","Litronic Industries","14225 Hancock Dr","Anchorage","Anchorage","AK","99515","907-345-0962","907-345-1215","essie#vaill.com","http://www.essievaill.com"
"Cruz","Roudabush","Meridian Products","2202 S Central Ave","Phoenix","Maricopa","AZ","85004","602-252-4827","602-252-4009","cruz#roudabush.com","http://www.cruzroudabush.com"
"Billie","Tinnes","D & M Plywood Inc","28 W 27th St","New York","New York","NY","10001","212-889-5775","212-889-5764","billie#tinnes.com","http://www.billietinnes.com"
"Zackary","Mockus","Metropolitan Elevator Co","286 State St","Perth Amboy","Middlesex","NJ","08861","732-442-0638","732-442-5218","zackary#mockus.com","http://www.zackarymockus.com"
"Rosemarie","Fifield","Technology Services","3131 N Nimitz Hwy #-105","Honolulu","Honolulu","HI","96819","808-836-8966","808-836-6008","rosemarie#fifield.com","http://www.rosemariefifield.com"
"Bernard","Laboy","Century 21 Keewaydin Prop","22661 S Frontage Rd","Channahon","Will","IL","60410","815-467-0487","815-467-1244","bernard#laboy.com","http://www.bernardlaboy.com"
"Sue","Haakinson","Kim Peacock Beringhause","9617 N Metro Pky W","Phoenix","Maricopa","AZ","85051","602-953-2753","602-953-0355","sue#haakinson.com","http://www.suehaakinson.com"
"Valerie","Pou","Sea Port Record One Stop Inc","7475 Hamilton Blvd","Trexlertown","Lehigh","PA","18087","610-395-8743","610-395-6995","valerie#pou.com","http://www.valeriepou.com"
"Lashawn","Hasty","Kpff Consulting Engineers","815 S Glendora Ave","West Covina","Los Angeles","CA","91790","626-960-6738","626-960-1503","lashawn#hasty.com","http://www.lashawnhasty.com"
"Marianne","Earman","Albers Technologies Corp","6220 S Orange Blossom Trl","Orlando","Orange","FL","32809","407-857-0431","407-857-2506","marianne#earman.com","http://www.marianneearman.com"
"Justina","Dragaj","Uchner, David D Esq","2552 Poplar Ave","Memphis","Shelby","TN","38112","901-327-5336","901-327-2911","justina#dragaj.com","http://www.justinadragaj.com"
"Mandy","Mcdonnell","Southern Vermont Surveys","343 Bush St Se","Salem","Marion","OR","97302","503-371-8219","503-371-1118","mandy#mcdonnell.com","http://www.mandymcdonnell.com"
"Conrad","Lanfear","Kahler, Karen T Esq","49 Roche Way","Youngstown","Mahoning","OH","44512","330-758-0314","330-758-3536","conrad#lanfear.com","http://www.conradlanfear.com"
"Cyril","Behen","National Paper & Envelope Corp","1650 S Harbor Blvd","Anaheim","Orange","CA","92802","714-772-5050","714-772-3859","cyril#behen.com","http://www.cyrilbehen.com"
"Shelley","Groden","Norton, Robert L Esq","110 Broadway St","San Antonio","Bexar","TX","78205","210-229-3017","210-229-9757","shelley#groden.com","http://www.shelleygroden.com"
Our teacher wanted us to create sed commands that would automatically indent the data, add TR to the front and back of the data and add TD tags to each new field.
<HTML>
<HEAD><Title>Lab 4b by Andrey</Title></HEAD>
<BODY>
<table border="1">
<TR><TD>FirstName</TD><TD>LastName</TD><TD>Company</TD><TD>Address</TD><TD>City</TD><TD>County</TD><TD>State</TD><TD>ZIP</TD><TD>Phone</TD><TD>Fax</TD><TD>Email</TD><TD>Web</TD></TR>
<TR><TD>Essie</TD><TD>Vaill</TD><TD>Litronic Industries</TD><TD>14225 Hancock Dr</TD><TD>Anchorage</TD><TD>Anchorage</TD><TD>AK</TD><TD>99515</TD><TD>907-345-0962</TD><TD>907-345-1215</TD><TD>essie#vaill.com</TD><TD>http://www.essievaill.com</TD><TR>
<TR><TD>Cruz</TD><TD>Roudabush</TD><TD>Meridian Products</TD><TD>2202 S Central Ave</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85004</TD><TD>602-252-4827</TD><TD>602-252-4009</TD><TD>cruz#roudabush.com</TD><TD>http://www.cruzroudabush.com</TD><TR>
<TR><TD>Billie</TD><TD>Tinnes</TD><TD>D & M Plywood Inc</TD><TD>28 W 27th St</TD><TD>New York</TD><TD>New York</TD><TD>NY</TD><TD>10001</TD><TD>212-889-5775</TD><TD>212-889-5764</TD><TD>billie#tinnes.com</TD><TD>http://www.billietinnes.com</TD><TR>
<TR><TD>Zackary</TD><TD>Mockus</TD><TD>Metropolitan Elevator Co</TD><TD>286 State St</TD><TD>Perth Amboy</TD><TD>Middlesex</TD><TD>NJ</TD><TD>08861</TD><TD>732-442-0638</TD><TD>732-442-5218</TD><TD>zackary#mockus.com</TD><TD>http://www.zackarymockus.com</TD><TR>
<TR><TD>Rosemarie</TD><TD>Fifield</TD><TD>Technology Services</TD><TD>3131 N Nimitz Hwy #-105</TD><TD>Honolulu</TD><TD>Honolulu</TD><TD>HI</TD><TD>96819</TD><TD>808-836-8966</TD><TD>808-836-6008</TD><TD>rosemarie#fifield.com</TD><TD>http://www.rosemariefifield.com<$
<TR><TD>Bernard</TD><TD>Laboy</TD><TD>Century 21 Keewaydin Prop</TD><TD>22661 S Frontage Rd</TD><TD>Channahon</TD><TD>Will</TD><TD>IL</TD><TD>60410</TD><TD>815-467-0487</TD><TD>815-467-1244</TD><TD>bernard#laboy.com</TD><TD>http://www.bernardlaboy.com</TD><TR>
<TR><TD>Sue</TD><TD>Haakinson</TD><TD>Kim Peacock Beringhause</TD><TD>9617 N Metro Pky W</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85051</TD><TD>602-953-2753</TD><TD>602-953-0355</TD><TD>sue#haakinson.com</TD><TD>http://www.suehaakinson.com</TD><TR>
<TR><TD>Valerie</TD><TD>Pou</TD><TD>Sea Port Record One Stop Inc</TD><TD>7475 Hamilton Blvd</TD><TD>Trexlertown</TD><TD>Lehigh</TD><TD>PA</TD><TD>18087</TD><TD>610-395-8743</TD><TD>610-395-6995</TD><TD>valerie#pou.com</TD><TD>http://www.valeriepou.com</TD><TR>
<TR><TD>Lashawn</TD><TD>Hasty</TD><TD>Kpff Consulting Engineers</TD><TD>815 S Glendora Ave</TD><TD>West Covina</TD><TD>Los Angeles</TD><TD>CA</TD><TD>91790</TD><TD>626-960-6738</TD><TD>626-960-1503</TD><TD>lashawn#hasty.com</TD><TD>http://www.lashawnhasty.com</TD><T$
<TR><TD>Marianne</TD><TD>Earman</TD><TD>Albers Technologies Corp</TD><TD>6220 S Orange Blossom Trl</TD><TD>Orlando</TD><TD>Orange</TD><TD>FL</TD><TD>32809</TD><TD>407-857-0431</TD><TD>407-857-2506</TD><TD>marianne#earman.com</TD><TD>http://www.marianneearman.com</TD$
<TR><TD>Justina</TD><TD>Dragaj</TD><TD>Uchner David D Esq</TD><TD>2552 Poplar Ave</TD><TD>Memphis</TD><TD>Shelby</TD><TD>TN</TD><TD>38112</TD><TD>901-327-5336</TD><TD>901-327-2911</TD><TD>justina#dragaj.com</TD><TD>http://www.justinadragaj.com</TD><TR>
<TR><TD>Mandy</TD><TD>Mcdonnell</TD><TD>Southern Vermont Surveys</TD><TD>343 Bush St Se</TD><TD>Salem</TD><TD>Marion</TD><TD>OR</TD><TD>97302</TD><TD>503-371-8219</TD><TD>503-371-1118</TD><TD>mandy#mcdonnell.com</TD><TD>http://www.mandymcdonnell.com</TD><TR>
<TR><TD>Conrad</TD><TD>Lanfear</TD><TD>Kahler Karen T Esq</TD><TD>49 Roche Way</TD><TD>Youngstown</TD><TD>Mahoning</TD><TD>OH</TD><TD>44512</TD><TD>330-758-0314</TD><TD>330-758-3536</TD><TD>conrad#lanfear.com</TD><TD>http://www.conradlanfear.com</TD><TR>
<TR><TD>Cyril</TD><TD>Behen</TD><TD>National Paper & Envelope Corp</TD><TD>1650 S Harbor Blvd</TD><TD>Anaheim</TD><TD>Orange</TD><TD>CA</TD><TD>92802</TD><TD>714-772-5050</TD><TD>714-772-3859</TD><TD>cyril#behen.com</TD><TD>http://www.cyrilbehen.com</TD><TR>
<TR><TD>Shelley</TD><TD>Groden</TD><TD>Norton Robert L Esq</TD><TD>110 Broadway St</TD><TD>San Antonio</TD><TD>Bexar</TD><TD>TX</TD><TD>78205</TD><TD>210-229-3017</TD><TD>210-229-9757</TD><TD>shelley#groden.com</TD><TD>http://www.shelleygroden.com</TD><TR>
</table>
</BODY>
</HTML>
So, I was messing around and I tired to create a few sed commands that would mimic the second output.
My first attempt was:
#!/bin/sh
sed -e 's=^.*$=<TR><TD>&</TD></TR>=' input.csv
Unfortunately, this program only outputs something like this where I get TR TD at the beginning and end, but no TD tags inside:
<TR><TD>"Bryan","Rovell","All N All Shop","90 Hackensack St","East Rutherford","Bergen","NJ","07073","201-939-2788","201-939-9079","bryan#rovell.com","http://www.bryanrovell.com"</TD></TR>
<TR><TD>"Joey","Bolick","Utility Trailer Sales","7700 N Council Rd","Oklahoma City","Oklahoma","OK","73132","405-728-5972","405-728-5244","joey#bolick.com","http://www.joeybolick.com"</TD></TR>
I've also attempted to create individual seds to tag field, but instead I've only managed to tag each word, so I'm kinda stuck.
I'm partially on the right track, I think, but I need helping indenting and adding TD to the beginning & end of every field, along with TR to the beginning and end of each new column.
This is the main part of it:
$ sed -r 's:^"?: <TR><TD>:; s:"?,"?:</TD><TD>:g; s:"?$:</TD></TR>:' file
<TR><TD>FirstName</TD><TD>LastName</TD><TD>Company</TD><TD>Address</TD><TD>City</TD><TD>County</TD><TD>State</TD><TD>ZIP</TD><TD>Phone</TD><TD>Fax</TD><TD>Email</TD><TD>Web</TD></TR>
<TR><TD>Essie</TD><TD>Vaill</TD><TD>Litronic Industries</TD><TD>14225 Hancock Dr</TD><TD>Anchorage</TD><TD>Anchorage</TD><TD>AK</TD><TD>99515</TD><TD>907-345-0962</TD><TD>907-345-1215</TD><TD>essie#vaill.com</TD><TD>http://www.essievaill.com</TD></TR>
<TR><TD>Cruz</TD><TD>Roudabush</TD><TD>Meridian Products</TD><TD>2202 S Central Ave</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85004</TD><TD>602-252-4827</TD><TD>602-252-4009</TD><TD>cruz#roudabush.com</TD><TD>http://www.cruzroudabush.com</TD></TR>
<TR><TD>Billie</TD><TD>Tinnes</TD><TD>D & M Plywood Inc</TD><TD>28 W 27th St</TD><TD>New York</TD><TD>New York</TD><TD>NY</TD><TD>10001</TD><TD>212-889-5775</TD><TD>212-889-5764</TD><TD>billie#tinnes.com</TD><TD>http://www.billietinnes.com</TD></TR>
<TR><TD>Zackary</TD><TD>Mockus</TD><TD>Metropolitan Elevator Co</TD><TD>286 State St</TD><TD>Perth Amboy</TD><TD>Middlesex</TD><TD>NJ</TD><TD>08861</TD><TD>732-442-0638</TD><TD>732-442-5218</TD><TD>zackary#mockus.com</TD><TD>http://www.zackarymockus.com</TD></TR>
<TR><TD>Rosemarie</TD><TD>Fifield</TD><TD>Technology Services</TD><TD>3131 N Nimitz Hwy #-105</TD><TD>Honolulu</TD><TD>Honolulu</TD><TD>HI</TD><TD>96819</TD><TD>808-836-8966</TD><TD>808-836-6008</TD><TD>rosemarie#fifield.com</TD><TD>http://www.rosemariefifield.com</TD></TR>
<TR><TD>Bernard</TD><TD>Laboy</TD><TD>Century 21 Keewaydin Prop</TD><TD>22661 S Frontage Rd</TD><TD>Channahon</TD><TD>Will</TD><TD>IL</TD><TD>60410</TD><TD>815-467-0487</TD><TD>815-467-1244</TD><TD>bernard#laboy.com</TD><TD>http://www.bernardlaboy.com</TD></TR>
<TR><TD>Sue</TD><TD>Haakinson</TD><TD>Kim Peacock Beringhause</TD><TD>9617 N Metro Pky W</TD><TD>Phoenix</TD><TD>Maricopa</TD><TD>AZ</TD><TD>85051</TD><TD>602-953-2753</TD><TD>602-953-0355</TD><TD>sue#haakinson.com</TD><TD>http://www.suehaakinson.com</TD></TR>
<TR><TD>Valerie</TD><TD>Pou</TD><TD>Sea Port Record One Stop Inc</TD><TD>7475 Hamilton Blvd</TD><TD>Trexlertown</TD><TD>Lehigh</TD><TD>PA</TD><TD>18087</TD><TD>610-395-8743</TD><TD>610-395-6995</TD><TD>valerie#pou.com</TD><TD>http://www.valeriepou.com</TD></TR>
<TR><TD>Lashawn</TD><TD>Hasty</TD><TD>Kpff Consulting Engineers</TD><TD>815 S Glendora Ave</TD><TD>West Covina</TD><TD>Los Angeles</TD><TD>CA</TD><TD>91790</TD><TD>626-960-6738</TD><TD>626-960-1503</TD><TD>lashawn#hasty.com</TD><TD>http://www.lashawnhasty.com</TD></TR>
<TR><TD>Marianne</TD><TD>Earman</TD><TD>Albers Technologies Corp</TD><TD>6220 S Orange Blossom Trl</TD><TD>Orlando</TD><TD>Orange</TD><TD>FL</TD><TD>32809</TD><TD>407-857-0431</TD><TD>407-857-2506</TD><TD>marianne#earman.com</TD><TD>http://www.marianneearman.com</TD></TR>
<TR><TD>Justina</TD><TD>Dragaj</TD><TD>Uchner</TD><TD> David D Esq</TD><TD>2552 Poplar Ave</TD><TD>Memphis</TD><TD>Shelby</TD><TD>TN</TD><TD>38112</TD><TD>901-327-5336</TD><TD>901-327-2911</TD><TD>justina#dragaj.com</TD><TD>http://www.justinadragaj.com</TD></TR>
<TR><TD>Mandy</TD><TD>Mcdonnell</TD><TD>Southern Vermont Surveys</TD><TD>343 Bush St Se</TD><TD>Salem</TD><TD>Marion</TD><TD>OR</TD><TD>97302</TD><TD>503-371-8219</TD><TD>503-371-1118</TD><TD>mandy#mcdonnell.com</TD><TD>http://www.mandymcdonnell.com</TD></TR>
<TR><TD>Conrad</TD><TD>Lanfear</TD><TD>Kahler</TD><TD> Karen T Esq</TD><TD>49 Roche Way</TD><TD>Youngstown</TD><TD>Mahoning</TD><TD>OH</TD><TD>44512</TD><TD>330-758-0314</TD><TD>330-758-3536</TD><TD>conrad#lanfear.com</TD><TD>http://www.conradlanfear.com</TD></TR>
<TR><TD>Cyril</TD><TD>Behen</TD><TD>National Paper & Envelope Corp</TD><TD>1650 S Harbor Blvd</TD><TD>Anaheim</TD><TD>Orange</TD><TD>CA</TD><TD>92802</TD><TD>714-772-5050</TD><TD>714-772-3859</TD><TD>cyril#behen.com</TD><TD>http://www.cyrilbehen.com</TD></TR>
<TR><TD>Shelley</TD><TD>Groden</TD><TD>Norton</TD><TD> Robert L Esq</TD><TD>110 Broadway St</TD><TD>San Antonio</TD><TD>Bexar</TD><TD>TX</TD><TD>78205</TD><TD>210-229-3017</TD><TD>210-229-9757</TD><TD>shelley#groden.com</TD><TD>http://www.shelleygroden.com</TD></TR>
I expect you can figure out the rest since that's just printing the head and tail lines.