Often I find myself doing repetitive file & replace operations in a file. Most often that comes down to fixed find and replace operations; deleting some lines, changing some strings that are always the same and so on.
In Vim that is a no-brainer,
function! Modify_Strength_Files()
execute':%s/?/-/'
execute':%s/Ä/-/'
"--------------------------------------------------------
execute':%s/Ä/-/'
execute':%s///g'
"--------------------------------------------------------
execute':g/Version\ of\ Light\ Ship/d'
execute':g/Version\ of\ Data\ for\ Specific\ Regulations/d'
"--------------------------------------------------------
" execute':g/LOADING\ CONDITION/d'
" execute':g/REGULATION:\ A\.562\ IMO\ Resolution/d'
" This is to reduce multiple blank lines into one.
execute ':%s/\s\+$//e'
execute ':%s/\n\{3,}/\r\r/e'
" ---------------------
endfunction
copied verbatim.
How could a function like this be defined in Sublime Text editor, if it can be done at all, and then called to act upon the currently opened file?
Here are resources to write Sublime Text 2 plugins:
Sublime Text 2 API Reference
Sublime Text 2 Plugin Examples
How to run Sublime Text 2 commands
Setting up Sublime Text 2 Custom Keyboard Shortcuts
Example: you can write a similar plugin and bind a hot key to it, that is, batch_edit command. Then you can open a file and execute the command via that hot key. By the way, in this script, I didn't consider the file encoding. You can get the file encoding via self.view.encoding().
# -*- coding: utf-8 -*-
import sublime, sublime_plugin
import re
class BatchEditCommand(sublime_plugin.TextCommand):
def run(self, edit):
self._edit = edit
self._replace_all(r"\?", "-")
self._replace_all(u"Ä", "-")
self._delete_line_with(r"Version of Light Ship")
self._delete_line_with(r"Version of Data for Specific Regulations")
self._replace_all(r"(\n\s*\n)+", "\n\n")
def _get_file_content(self):
return self.view.substr(sublime.Region(0, self.view.size()))
def _update_file(self, doc):
self.view.replace(self._edit, sublime.Region(0, self.view.size()), doc)
def _replace_all(self, regex, replacement):
doc = self._get_file_content()
p = re.compile(regex, re.UNICODE)
doc = re.sub(p, replacement, doc)
self._update_file(doc)
def _delete_line_with(self, regex):
doc = self._get_file_content()
lines = doc.splitlines()
result = []
for line in lines:
if re.search(regex, line, re.UNICODE):
continue
result.append(line)
line_ending = {
"Windows" : "\r\n",
"Unix" : "\n",
"CR" : "\r"
}[self.view.line_endings()]
doc = line_ending.join(result)
self._update_file(doc)
Related
I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:
I am writing a very simple python script to READ a CSV (no problem) and to write to another CSV (issue):
System info:
Windows 10
Powershell
Python 3.6.5 :: Anaconda, Inc.
Sample Data: Office Events
The purpose is to filter events based on criteria, and to write to another CSV with desired criteria.
For Example:
I would like to read from this CSV and write the events where Registrations (or column 4) is Greater than 0 (remove rows with registrations = 0)
# SCRIPT TO FILTER EVENTS TO BE PROCESSED
import os
import time
import shutil
import os.path
import fnmatch
import csv
import glob
import pandas
# Location of file containing ALL events
path = r'allEvents.csv'
# Writes to writer
writer = csv.writer(open(r'RegisteredEvents' + time.strftime("%m_%d_%Y-%I_%M_%S") + '.csv', "wb"))
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
#writer.writerow([r'Event Name', r'Start Date', r'End Date', r'Registrations', r'Total Revenue', r'ID', r'Status'])
#writer.writerow([b'Event Name', b'Start Date', b'End Date', b'Registrations', b'Total Revenue', b'ID', b'Status'])
def checkRegistrations(file):
reader = csv.reader(file)
data = list(reader)
for row in data:
#if row[3] > str(0):
if row[3] > int(0):
writer.writerow(([data]))
The Error I continue to get is:
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
TypeError: a bytes-like object is required, not 'str'
I have tried using the various commented out statements
For Example:
"" vs r"" vs r'' vs b''
if row[3] > int(0) **vs** if row[3] > str(0)
Every time I execute my script, It creates the file.. so the first csv writer line works (create and open the file)... the second line (to write the headers) is when the error appears...
Perhaps I am getting mixed up with syntax due to python versions, or perhaps I am misusing the CSV library, or (more than likely) I have endless to learn about data type IO and conversion... someone please help!!
I am aware of the excess of import libraries -- script came from another basic script to move files from one location to another based on filename and output a rowcounter for each file being moved.
With that being said, I may be unaware of any missing/ needed libraries
Please let me know if you have any questions, concerns or clarifications
Thanks in advance!
It looks like you are calling:
writer = csv.writer(open('file.csv', 'wb'))
The 'wb' argument is the file mode. The 'b' means that you are opening the file that you are writing to in binary mode. You are then trying to write a string which isn't what it is expecting.
Try getting rid of the 'b' in the 'wb'.
writer = csv.writer(open('file.csv', 'w'))
Let me know if that works for you.
Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv
In the Preferences.sublime-settings file, a particular attribute called ensure_newline_at_eof_on_save exists and I'd like to only activate it for non-minified files. It'd be nice to pass it a regular expression like /min\.[^\.]+$/ to make it omit these files but that doesn't seem to work. Any suggestions?
EDIT
This is the contents of the trim_trailing_white_space.py file under the Default packages folder:
import sublime, sublime_plugin
class TrimTrailingWhiteSpace(sublime_plugin.EventListener):
def on_pre_save(self, view):
if view.settings().get("trim_trailing_white_space_on_save") == True:
trailing_white_space = view.find_all("[\t ]+$")
trailing_white_space.reverse()
edit = view.begin_edit()
for r in trailing_white_space:
view.erase(edit, r)
view.end_edit(edit)
class EnsureNewlineAtEof(sublime_plugin.EventListener):
def on_pre_save(self, view):
if view.settings().get("ensure_newline_at_eof_on_save") == True:
if view.size() > 0 and view.substr(view.size() - 1) != '\n':
edit = view.begin_edit()
view.insert(edit, view.size(), "\n")
view.end_edit(edit)
I have changed the second class to this implementation:
class EnsureNewlineAtEof(sublime_plugin.EventListener):
def on_pre_save(self, view):
if view.settings().get("ensure_newline_at_eof_on_save") == True:
if ".min." not in view.name():
if view.size() > 0 and view.substr(view.size() - 1) != '\n':
edit = view.begin_edit()
view.insert(edit, view.size(), "\n")
view.end_edit(edit)
But it still does not work.
This is actually pretty straightforward to do, and you don't even need regex to do it. Select Preferences -> Browse Packages... to open up Sublime's Packages folder in your operating system's file manager. Go into the Default folder and open trim_trailing_white_space.py in Sublime. Line 15 starts the third class definition in the file:
class EnsureNewlineAtEofCommand(sublime_plugin.TextCommand):
def run(self, edit):
if self.view.size() > 0 and self.view.substr(self.view.size() - 1) != '\n':
self.view.insert(edit, self.view.size(), "\n")
Delete the entire class (all four lines), then paste in the following:
class EnsureNewlineAtEofCommand(sublime_plugin.TextCommand):
def run(self, edit):
if ".min." not in self.view.name():
if self.view.size() > 0 and self.view.substr(self.view.size() - 1) != '\n':
self.view.insert(edit, self.view.size(), "\n")
What I did was add a test to see if the view's filename contains .min. (as in foobar.min.js or bottom.div.min.css, for example). If the filename does contain the pattern, nothing happens. If it doesn't, then a newline is added like normal.
Save the file (do not change the filename!), and restart Sublime for the changes to take effect.
Please note that the above instructions are for Sublime Text 2. ST3 does not have its packages directly available, so you'll need to install PackageResourceViewer from Package Control, select PackageResourceViewer: Open Resource from the Command Palette, then select Default and trim_trailing_white_space.py. Edit the file as above and save it, and the changes should take effect immediately, although you may wish to restart Sublime just for fun.
I had a question concerning some basic transformations in Haskell.
Basically, I have a written Input file, named Input.md. This contains some markdown text that is read in my project file, and I want to write a few functions to do transformations on the text. After completing these functions under a function called convertToHTML, I have output the file as an .html file in the correct format.
module Main
(
convertToHTML,
main
) where
import System.Environment (getArgs)
import System.IO
import Data.Char (toLower, toUpper)
process :: String -> String
process s = head $ lines s
convertToHTML :: String -> String
convertToHTML str = do
x <- str
if (x == '#')
then "<h1>"
else return x
--convertToHTML x = map toUpper x
main = do
args <- getArgs -- command line args
let (infile,outfile) = (\(x:y:ys)->(x,y)) args
putStrLn $ "Input file: " ++ infile
putStrLn $ "Output file: " ++ outfile
contents <- readFile infile
writeFile outfile $ convertToHTML contents
So,
How would I read through my input file, and transform any line that starts with a # to an html tag
How would I read through my input file once more and transform any WORD that is surrounded by _word_ (1 underscore) to another html tag
Replace any Character with an html string.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text. Please if anybody has any suggestions. I've been working on this for 2 days straight and have a bunch of failed code to show for a couple of weeks and have a bunch of failed code to show it.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text.
Because they work on appropriate element collection. And they don't really "iterate"; you simply have to feed the appropriate data. Let's tackle the # problem as an example.
Our file is one giant String, and what we'd like is to have it nicely split in lines, so [String]. What could do it for us? I have no idea, so let's just search Hoogle for String -> [String].
Ah, there we go, lines function! Its counterpart, unlines, is also going to be useful. Now we can write our line wrapper:
convertHeader :: String -> String
convertHeader [] = [] -- that prevents us from calling head on an empty line
convertHeader x = if head x == '#' then "<h1>" ++ x ++ "</h1>"
else x
and so:
convertHeaders :: String -> String
convertHeaders = unlines . map convertHeader . lines
-- ^String ^[String] ^[String] ^String
As you can see the function first converts the file to lines, maps convertHeader on each line, and the puts the file back together.
See it live on Ideone
Try now doing the same with words to replace your formatting patterns. As a bonus exercise, change convertHeader to count the number of # in front of the line and output <h1>, <h2>, <h3> and so on accordingly.