NOAA data and writing a csv file - csv

I am trying to create a csv file from NOAA data from their http://www.srh.noaa.gov/data/obhistory/PAFA.html.
At the moment, I am having problems writing the csv file.
import urllib2 as urllib
from bs4 import BeautifulSoup
from time import localtime, strftime
import csv
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
with open("/home/eyalak/Documents/weather/weather.csv", "wb") as f:
writer = csv.writer(f)
print date, time, wind
writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
writer.writerow(str(time)+str(wind)+str(date)+'\n')
if row_count == 74:
print "74"
The printed result is fine, it is the file that is not. I get:
Title 1,Title 2,Title 3
0,5,:,5,3,C,a,l,m,0,8,"
The problems in the csv file created are: 1. The title is broken into the wrong columns;column 2, has "1,Title" versus "title 2" 2. The data is comma delineated in the wrong places 3. As The script writes new lines it over writes on the previous one, instead of appending from the bottom. Any thoughts?

As far as overwriting rows try opening the file with the 'a' (append) option rather than 'wb'. As far as fixing the comma delineation try to encapsulate each string in square brackets. Take a look at the two examples here to see the difference:
import csv
text = 'This is a string'
with open('test.csv','a') as f:
writer = csv.writer(f)
writer.writerow(text)
This creates a csv whose first row is each letter of text is separated by a comma. Alternatively,
import csv
text = 'This is a string'
with open('test.csv','a') as f:
writer = csv.writer(f)
writer.writerow([text])
This will create a csv file whose first row contains only one item that is text and there is no comma separating the characters.

Related

Azure Bicep - load excel file in Bicep

I would like to load the values from excel file, they are only names inside it and I have a lot of them. So I don't want to copy all of them and place them in an array. I want some solution if it's possible like [loadJsonContent].
I want some solution if it's possible like [loadJsonContent].
If you want built-in File function for bicep, the answer is NO.
From the official document, File functions for Bicep only have three:
loadFileAsBase64
Loads the file as a base64 string.
loadJsonContent
Loads the specified JSON file as an Any object.
loadTextContent
Loads the content of the specified file as a string.
I think your requirement needs to achieve via writing code.
By the way, you didn't clearly define the excel file, xlsx or csv? And if possible, please provide a sample file format so that we can provide specific code.
For example, I have a Student.xlsx file like this(CSV file is also this structure.):
Then I can use this Python code to parse and get the data I want:
import os
import openpyxl
import csv
#get the student name from 'Student Name' of sheet 'Tabelle1' of the file 'XLSX_Folder/Student.xlsx'
def get_student_name(file_path, sheet_name, col):
student_name = []
#if file_path ends with '.xlsx'
if file_path.endswith('.xlsx'):
wb = openpyxl.load_workbook(file_path)
sheet = wb[sheet_name]
#get all the values under the column 'Student Name'
for i in range(col, sheet.max_row+1):
student_name.append(sheet.cell(row=i, column=col).value)
print('This is xlsx file.')
return student_name
elif file_path.endswith('.csv'):
#get all the values under the column 'Student Name', except the first row
with open(file_path, 'r') as f:
reader = csv.reader(f)
for row in reader:
if reader.line_num == 1:
continue
student_name.append(row[col-1])
print('This is csv file.')
return student_name
print('This is csv file.')
else:
print('This is other format file.')
XLSX_file_path = 'XLSX_Folder/Student.xlsx'
CSV_file_path = 'CSV_Folder/Student.csv'
sheet_name = 'Tabelle1'
col = 2
print(get_student_name(XLSX_file_path, sheet_name, col))
print(get_student_name(CSV_file_path, sheet_name, col))
Result:
After that, parse your bicep file and put the above data into your bicep file.
The above code is just a demo, you can write your own code with the develop language you like. Anyway, no built-in feature of your requirement.

Extracting csv file rows as individual .txt files

I am new to Python and trying to extract certain data from rows of a csv file into individual .txt files (to create a corpus for NLP). So far I have the following:
import csv
with open(r"file.csv", "r+", encoding='utf-8') as f:
reader = csv.reader(f)
data = list(reader)
t = (data[1][91])
fn = str(data[1][90])
g = open("%s.txt" %fn,"w+")
for i in range(1):
g.write(t)
g.close
Which does what I want for the 1st row, however I am not sure how to get the program to loop up to row 1047. Note: the [1] signifies row 1, the [91] & [90] should remain fixed.
Thanks in advance!

How to add/change column names with pyarrow.read_csv?

I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?
I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.
Imagine that this csv file just has for an easy example the two columns "A" and "B".
My current code looks like this:
import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)
df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
'B': 'int8'})
table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)
If I print out the final table, its not going to change the names of the columns.
pyarrow.Table
1: int64
3: int64
How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?
You can specify type overrides for columns:
fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
convert_options=csv.ConvertOptions(
column_types={
'one': pa.int8(),
'two': pa.int8(),
'three': pa.int8(),
}
))
But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:
fp = io.BytesIO(b'1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
parse_options=csv.ParseOptions(header_rows=0)
)
This raises:
pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names
The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138
This is similar to this question apache arrow - reading csv file
There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898

How to control quoting on non-numerical entries in a csv file?

I am using Python3's csv module and am wondering why I cannot control quoting correctly. I am using the option quoting = csv.QUOTE_NONNUMERIC but am still seeing all entries quoted. Any idea as to why that is?
Here's my code. Essentially, I am reading in a csv file and want to remove all duplicate lines that have the same text string:
import sys
import csv
class Row:
def __init__(self, row):
self.text, self.a, self.b = row
self.elements = row
with open(sys.argv[2], 'w', newline='') as output:
writer = csv.writer(output, delimiter=';', quotechar='"',
quoting=csv.QUOTE_NONNUMERIC)
with open(sys.argv[1]) as input:
reader = csv.reader(input, delimiter=';')
header = next(reader)
Row.labels = header
assert Row.labels[1] == 'Label1'
writer.writerow(header)
texts = set()
for row in reader:
row_object = Row(row)
if row_object.text not in texts:
writer.writerow(row_object.elements)
texts.add(row_object.text)
When I look at the generated file, the content looks like this:
"Label1";"Label2";"Label3"
"AAA";"123";"456"
...
But I want this:
"Label1";"Label2";"Label3"
"AAA";123;456
...
OK ... I figured it out myself. The answer, I am afraid, was rather simple - and obvious in retrospect. Since the content of each line is obtained from a csv.reader()its elements are strings by default. As a result, the get quoted by the subsequently employed csv.writer().
To be treated as an int, they first need to be cast to an int:
row_object.elements[1]= int(row_object.a)
This explanation can be proven by inserting a type check before and after this cast:
print('Type: {}'.format(type(row_object.elements[1])))

Reading CSV file and generating Dictionaries

I have a CSV file looks like
Hit39, Hit24, Hit9
Hit8, Hit39, Hit21
Hit46, Hit47, Hit20
Hit24, Hit 53, Hit46
I want to read file and create a dictionary based on the first come first serve first basis
like Hit39 : 1, Hit 24:2 and so on ...
but notice Hit39 appeared on column 2 and row2 . So if the reader reads it then it should not append it to dictionary it will move on with the new number.
Once a row number is visited it shouldn't include numbers after that if appeared.
Using Python - Best guess until the OP is clarified - treat the file as though it was one huge list and assign an incrementing variable to unique occurences of value.
import csv
from itertools import count
mydict = {}
counter = count(1)
with open('infile.csv') as fin:
for row in csv.reader(fin, skipinitialspace=True):
for col in row:
mydict[col] = mydict.get(col, next(counter))
Since Python is a popular language that has dictionaries, you must be using Python. At least I assume.
import csv
reader = csv.reader(file("filename.csv"))
d = dict((line[0], 1+lineno) for lineno, line in enumerate(reader))
print d