Extract tabular data from a html and save as text file - html

I wanted to extract the tabular data from html and save as text file
import urllib2, numpy as np, pandas as pd
fo = 'fo.txt'
url = 'https://coinmarketcap.com/currencies/bitcoin/historical-data/'
html = urllib2.urlopen(url).read()
rows = pd.read_html(html)
print type(rows)
print rows
for row in rows:
this_row = "|".join([str(td) for td in row])
fo.write(this_row + "\n")
But got the error:
Traceback (most recent call last):
fo.write(this_row + "\n")
AttributeError: 'str' object has no attribute 'write'
the resulted tabular data in the text file would look as in the original link:
https://coinmarketcap.com/currencies/bitcoin/historical-data/
Any help, please!

If you want to write to a text file you need a file object. In your source code the fo object is a string.
In python you can open a file for writing like this:
with open(fo,'w') as text_file:
for row in rows:
this_row = row
text_file.write(this_row + "\n")

Related

Azure Bicep - load excel file in Bicep

I would like to load the values from excel file, they are only names inside it and I have a lot of them. So I don't want to copy all of them and place them in an array. I want some solution if it's possible like [loadJsonContent].
I want some solution if it's possible like [loadJsonContent].
If you want built-in File function for bicep, the answer is NO.
From the official document, File functions for Bicep only have three:
loadFileAsBase64
Loads the file as a base64 string.
loadJsonContent
Loads the specified JSON file as an Any object.
loadTextContent
Loads the content of the specified file as a string.
I think your requirement needs to achieve via writing code.
By the way, you didn't clearly define the excel file, xlsx or csv? And if possible, please provide a sample file format so that we can provide specific code.
For example, I have a Student.xlsx file like this(CSV file is also this structure.):
Then I can use this Python code to parse and get the data I want:
import os
import openpyxl
import csv
#get the student name from 'Student Name' of sheet 'Tabelle1' of the file 'XLSX_Folder/Student.xlsx'
def get_student_name(file_path, sheet_name, col):
student_name = []
#if file_path ends with '.xlsx'
if file_path.endswith('.xlsx'):
wb = openpyxl.load_workbook(file_path)
sheet = wb[sheet_name]
#get all the values under the column 'Student Name'
for i in range(col, sheet.max_row+1):
student_name.append(sheet.cell(row=i, column=col).value)
print('This is xlsx file.')
return student_name
elif file_path.endswith('.csv'):
#get all the values under the column 'Student Name', except the first row
with open(file_path, 'r') as f:
reader = csv.reader(f)
for row in reader:
if reader.line_num == 1:
continue
student_name.append(row[col-1])
print('This is csv file.')
return student_name
print('This is csv file.')
else:
print('This is other format file.')
XLSX_file_path = 'XLSX_Folder/Student.xlsx'
CSV_file_path = 'CSV_Folder/Student.csv'
sheet_name = 'Tabelle1'
col = 2
print(get_student_name(XLSX_file_path, sheet_name, col))
print(get_student_name(CSV_file_path, sheet_name, col))
Result:
After that, parse your bicep file and put the above data into your bicep file.
The above code is just a demo, you can write your own code with the develop language you like. Anyway, no built-in feature of your requirement.

How to extract html links with a matching word

I am trying to make a crawler that can use a text file list of urls be turned auto assigned as variables to later be added into a list that can be parsed to search for urls containing the word "wp- ". Unfortunately, I am getting stuck at the part where I need to scrape the page to see if any urls bring up " wp- ". I've tried a number of ways but nothing is working. I've tried various semblances of
//a[contains(#href, 'wp-')]
but it does not work. Any suggestions on how to get the parsing for wp- working?
Here is my code so far
'''
#!/usr/#!/usr/bin/python
import urllib.request
import urlopen
# import urls into readable python file
f = open("url-list.txt", "r")
text = f.read()
# turn urls in file into a list by spliting it into lines
text_list = text.splitlines()
f.close()
#print(text_list) #dont need to show the links as list
#make list into variables
count = 0
for breakaway in text_list: #made iterate of a list to set their value
count = count + 1
print(count + 0, " Sending url-list to scraper...")
for url in //a[contains(#href, 'wp-')].extract():
print(url)
'''

Extracting csv file rows as individual .txt files

I am new to Python and trying to extract certain data from rows of a csv file into individual .txt files (to create a corpus for NLP). So far I have the following:
import csv
with open(r"file.csv", "r+", encoding='utf-8') as f:
reader = csv.reader(f)
data = list(reader)
t = (data[1][91])
fn = str(data[1][90])
g = open("%s.txt" %fn,"w+")
for i in range(1):
g.write(t)
g.close
Which does what I want for the 1st row, however I am not sure how to get the program to loop up to row 1047. Note: the [1] signifies row 1, the [91] & [90] should remain fixed.
Thanks in advance!

Python3 Replacing special character from .csv file after convert the same from JSON

I am trying to develop a program using Python3.6.4 which convert a JSON file into a CSV file and also we need to clean the data in the csv file. as for example:
My JSON File:
{emp:[{"Name":"Bo#b","email":"bob#gmail.com","Des":"Unknown"},
{"Name":"Martin","email":"mar#tin#gmail.com","Des":"D#eveloper"}]}
Problem 1:
After converting that into csv its creating a blank row between every 2 rows. As
**Name email Des**
[<BLANK ROW>]
Bo#b bob#gmail.com Unknown
[<BLANK ROW>]
Martin mar#tin#gmail.com D#eveloper
Problem 2:
In my code I am using emp but I need to use it dynamically.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read()
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
emp_data = employee_parsed['employee']
As we will not know the structure or content of up-coming JSON file.
Problem 3:
I also need to remove all # characters from the CSV file.
For solving Problem 3, you can use .replace (https://www.tutorialspoint.com/python/string_replace.htm).
For problem 2, you can use the dictionary keys and then get the zeroth item out of it.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read().replace("#", "")
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
first_key = employee_parsed.keys()[0]
emp_data = employee_parsed[first_key]
I can't solve problem 1 without more code to see how your are exporting the result. It may be that your data has newlines in it. In which case, you could add .replace("\n","") and/or .replace("\r","") after the previous replace so the line would read fobj.read().replace("#", "").replace("\n", "").replace("\r", "").

NOAA data and writing a csv file

I am trying to create a csv file from NOAA data from their http://www.srh.noaa.gov/data/obhistory/PAFA.html.
At the moment, I am having problems writing the csv file.
import urllib2 as urllib
from bs4 import BeautifulSoup
from time import localtime, strftime
import csv
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
with open("/home/eyalak/Documents/weather/weather.csv", "wb") as f:
writer = csv.writer(f)
print date, time, wind
writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
writer.writerow(str(time)+str(wind)+str(date)+'\n')
if row_count == 74:
print "74"
The printed result is fine, it is the file that is not. I get:
Title 1,Title 2,Title 3
0,5,:,5,3,C,a,l,m,0,8,"
The problems in the csv file created are: 1. The title is broken into the wrong columns;column 2, has "1,Title" versus "title 2" 2. The data is comma delineated in the wrong places 3. As The script writes new lines it over writes on the previous one, instead of appending from the bottom. Any thoughts?
As far as overwriting rows try opening the file with the 'a' (append) option rather than 'wb'. As far as fixing the comma delineation try to encapsulate each string in square brackets. Take a look at the two examples here to see the difference:
import csv
text = 'This is a string'
with open('test.csv','a') as f:
writer = csv.writer(f)
writer.writerow(text)
This creates a csv whose first row is each letter of text is separated by a comma. Alternatively,
import csv
text = 'This is a string'
with open('test.csv','a') as f:
writer = csv.writer(f)
writer.writerow([text])
This will create a csv file whose first row contains only one item that is text and there is no comma separating the characters.