Cannot CSV Load a file in Colab Using tf.compat.v1.keras.utils.get_file - csv

I have mounted my GDrive and have csv file in a folder. I am following the tutorial. However, when I issue the tf.keras.utils.get_file(), I get a ValueError As follows.
data_folder = r"/content/drive/My Drive/NLP/project2/data"
import os
print(os.listdir(data_folder))
It returns:
['crowdsourced_labelled_dataset.csv',
'P2_Testing_Dataset.csv',
'P2_Training_Dataset_old.csv',
'P2_Training_Dataset.csv']
TRAIN_DATA_URL = os.path.join(data_folder, 'P2_Training_Dataset.csv')
train_file_path = tf.compat.v1.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
But this returns:
Downloading data from /content/drive/My Drive/NLP/project2/data/P2_Training_Dataset.csv
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-5bd642083471> in <module>()
2 TRAIN_DATA_URL = os.path.join(data_folder, 'P2_Training_Dataset.csv')
3 TEST_DATA_URL = os.path.join(data_folder, 'P2_Testing_Dataset.csv')
----> 4 train_file_path = tf.compat.v1.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
5 test_file_path = tf.compat.v1.keras.utils.get_file("eval.csv", TEST_DATA_URL)
6 frames
/usr/lib/python3.6/urllib/request.py in _parse(self)
382 self.type, rest = splittype(self._full_url)
383 if self.type is None:
--> 384 raise ValueError("unknown url type: %r" % self.full_url)
385 self.host, self.selector = splithost(rest)
386 if self.host:
ValueError: unknown url type: '/content/drive/My Drive/NLP/project2/data/P2_Training_Dataset.csv'
What am I doing wrong please?

As per the docs, this will be the outcome of a call to the function tf.compat.v1.keras.utils.get_file.
tf.keras.utils.get_file(
fname,
origin,
untar=False,
md5_hash=None,
file_hash=None,
cache_subdir='datasets',
hash_algorithm='auto',
extract=False,
archive_format='auto',
cache_dir=None
)
By default the file at the url origin is downloaded to the cache_dir ~/.keras, placed in the cache_subdir datasets, and given the filename fname. The final location of a file example.txt would therefore be ~/.keras/datasets/example.txt.
Returns:
Path to the downloaded file
Since you already have the data in your drive, there's no need to download it again (and IIUC, the function is expecting an accessible URL). Also, there's no need of obtaining the file name from a function call because you already know it.
Assuming the drive is mounted, you can replace your file paths as below:
train_file_path = os.path.join(data_folder, 'P2_Training_Dataset.csv')
test_file_path = os.path.join(data_folder, 'P2_Testing_Dataset.csv')

Related

how to import variables from a json file to attributes in BUILD.bazel?

I would like to import variables defined in a json file(my_info.json) as attibutes for bazel rules.
I tried this (https://docs.bazel.build/versions/5.3.1/skylark/tutorial-sharing-variables.html) and works but do not want to use a .bzl file and import variables directly to attributes to BUILD.bazel.
I want to use those variables imported from my_info.json as attributes for other BUILD.bazel files.
projects/python_web/BUILD.bazel
load("//projects/tools/parser:config.bzl", "MY_REPO","MY_IMAGE")
container_push(
name = "publish",
format = "Docker",
registry = "registry.hub.docker.com",
repository = MY_REPO,
image = MY_IMAGE,
tag = "1",
)
Asking the similar in Bazel slack I was informed the is not possible to import variables directly to Bazel and it is needed to parse the json variables and write them into a .bzl file.
I tried also this code but nothing is written in config.bzl file.
my_info.json
{
"MYREPO" : "registry.hub.docker.com",
"MYIMAGE" : "michael/monorepo-python-web"
}
WORKSPACE.bazel
load("//projects/tools/parser:jsonparser.bzl", "load_my_json")
load_my_json(
name = "myjson"
)
projects/tools/parser/jsonparser.bzl
def _load_my_json_impl(repository_ctx):
json_data = json.decode(repository_ctx.read(repository_ctx.path(Label(":my_info.json"))))
config_lines = ["%s = %s" % (key, repr(val)) for key, val in json_data.items()]
repository_ctx.file("config.bzl", "\n".join(config_lines))
load_my_json = repository_rule(
implementation = _load_my_json_impl,
attrs = {},
)
projects/tools/parser/BUILD.bazel
load("#aspect_bazel_lib//lib:yq.bzl", "yq")
load(":config.bzl", "MYREPO", "MY_IMAGE")
yq(
name = "convert",
srcs = ["my_info2.json"],
args = ["-P"],
outs = ["bar.yaml"],
)
Executing:
% bazel build projects/tools/parser:convert
ERROR: Traceback (most recent call last):
File "/Users/michael.taquia/Documents/Personal/Projects/bazel/bazel-projects/multi-language-bazel-monorepo/projects/tools/parser/BUILD.bazel", line 2, column 22, in <toplevel>
load(":config.bzl", "MYREPO", "MY_IMAGE")
Error: file ':config.bzl' does not contain symbol 'MYREPO'
When making troubleshooting I see the execution calls the jsonparser.bzl but never enters to _load_my_json_impl function (based in print statements) and does not write anything to config.bzl.
Notes: Tested on macOS 12.6 (21G115 ) Darwin Kernel Version 21.6.0
There is a better way to do that? A code snippet will be very useful.

Loading Multiple CSV files across all subfolder levels with Wildcard file name

I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:

standard input file is closed stack traceback error when looping a function to open a file in lua

1local function readfile(name)
2 io.input(name)
3 local file = io.read("*all")
4 io.close(io.input())
5 return(file)
6end
31repeat
32 local input = io.read()
33 if string.sub(input, 1, 1) == ("#") then do
34 local ninput = string.sub(input, 2, 100)
35 print(readfile(ninput))
36 end
37end
38until input == ("stop")
i get this error every time i try to run the code. it runs once then errors out afterward and help is much apreciated.
lua:32: standard input file is closed
stack traceback:
[C]: in function 'io.read'
testing.lua:32: in main chunk
[C]: in ?
Read files with io.open to avoid touching stdin.
local function readfile(name)
local f = assert(io.open(name))
local content = f:read("*all")
f:close()
return content
end

Connection issues in Storage trigger GCF

For my application, new file uploaded to storage is read and the data is added to a main file. The new file contains 2 lines, one a header and other an array whose values are separated by a comma. The main file will need maximum of 265MB. The new files will have maximum of 30MB.
def write_append_to_ecg_file(filename,ecg,patientdata):
file1 = open('/tmp/'+ filename,"w+")
file1.write(":".join(patientdata))
file1.write('\n')
file1.write(",".join(ecg.astype(str)))
file1.close()
def storage_trigger_function(data,context):
#Download the segment file
download_files_storage(bucket_name,new_file_name,storage_folder_name = blob_path)
#Read the segment file
data_from_new_file,meta = read_new_file(new_file_name, scale=1, fs=125, include_meta=True)
print("Length of ECG data from segment {} file {}".format(segment_no,len(data_from_new_file)))
os.remove(new_file_name)
#Check if the main ecg_file_exists
file_exists = blob_exists(bucket_name, blob_with_the_main_file)
print("File status {}".format(file_exists))
data_from_main_file = []
if ecg_file_exists:
download_files_storage(bucket_name,main_file_name,storage_folder_name = blob_with_the_main_file)
data_from_main_file,meta = read_new_file(main_file_name, scale=1, fs=125, include_meta=True)
print("ECG data from main file {}".format(len(data_from_main_file)))
os.remove(main_file_name)
data_from_main_file = np.append(data_from_main_file,data_from_new_file)
print("data after appending {}".format(len(data_from_main_file)))
write_append_to_ecg_file(main_file,data_from_main_file,meta)
token = upload_files_storage(bucket_name,main_file,storage_folder_name = main_file_blob,upload_file = True)
else:
write_append_to_ecg_file(main_file,data_from_new_file,meta)
token = upload_files_storage(bucket_name,main_file,storage_folder_name = main_file_blob,upload_file = True)
The GCF is deployed
gcloud functions deploy storage_trigger_function --runtime python37 --trigger-resource patch-us.appspot.com --trigger-event google.storage.object.finalize --timeout 540s --memory 8192MB
For the first file, I was able to read the file and write the data to the main file. But after uploading the 2nd file, its giving Function execution took 70448 ms, finished with status: 'connection error' On uploading the 3rd file, it gives the Function invocation was interrupted. Error: memory limit exceeded. Despite of deploying the function with 8192MB memory, I am getting this error. Can I get some help on this.

Splitting a csv file into multiple files

I have a csv file of 150500 rows and I want to split it into multiple files containing 500 rows (entries)
I'm using Jupyter and I know how to open and read the file. However, I don't know how to specify an output_path to record the newly created files from splitting the big one.
I have found this code online but once again since I don't know what is my output_path I don't know how to use it. Moreover, for this block of code I don't understand how we specify the input file.
import os
def split(filehandler, delimiter=',', row_limit=1000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
import csv
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
headers = reader.next()
current_out_writer.writerow(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
if keep_headers:
current_out_writer.writerow(headers)
current_out_writer.writerow(row)
My file name is DataSet2.csv and it's in the same file in jupyter as my ipynb notebook is running.
number_of_small_files = 301
lines_per_small_file = 500
largeFile = open('large.csv', 'r')
header = largeFile.readline()
for i in range(number_of_small_files):
smallFile = open(str(i) + '_small.csv', 'w')
smallFile.write(header) # This line copies the header to all small files
for x in range(lines_per_small_file):
line = largeFile.readline()
smallFile.write(line)
smallFile.close()
largeFile.close()
This will create many small files in the same directory. About 301 of them. They will be named from 0_small.csv to 300_small.csv.
Using standard unix utilities:
cat DataSet2.csv | tail -n +2 | split -l 500 --additional-suffix=.csv output_
This pipeline takes the original file, strips off the first line with 'tail -n +2', and then splits the rest into 500 line chunks that are put into files with names that start with 'output_' and end with '.csv'