I can't find an option or workaround for this using pyarrow.csv.read_csv and there are many other reasons why using pandas doesn't work for us.
We have csv files with final columns that are effectively optional, and the source data doesn't always include empty cells for them, for example:
name,date,serial_number,prior_name,comments
A,2021-01-01,1234
B,2021-01-02,1235,A,Name changed for new version
C,2021-01-02,1236,B
This fails with an error like pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 columns, got 3:
I've got to assume that pyarrow can handle this, but I can't see how. Even the invalid row handler doesn't appear to let me return the "appropriate" value, only to "skip" these rows. That would even be okay if I could save them and append later, but as arrow tables are immutable, it just seems like there should be a more straightforward way to handle these cases.
As Pace noted, this is unfortunately not presently available in pyarrow. We actually process the "csv" files on the fly after extracting from a zip, and so creating intermediate files wasn't an option.
For anyone else needing a quick-ish way to handle this, I was able to get around this (and a couple other issues) by creating a class to wrap the stream and overwrite read(size, *args, **kwargs) to quickly perform the stripping. Even with the middleman class, it's faster than attempting to load in pandas (and there are
several other reasons why we aren't using pandas here).
Here's a template example:
class StreamWrapper():
def __init__(self, obj=None):
self.delegation = obj
self.header = None
def __getattr__(self, *args, **kwargs):
# any other call is delegated to the stream/object
return(self.delegation.__getattribute__(*args, **kwargs))
def read(self, size=None, *args, **kwargs):
bytedata = self.delegation.read(size, *args, **kwargs)
# .. the rest of the logic pre-processes the byte data,
# identifies the header and number of columns (which are retained persistently),
# and then strips out extra columns when encountered
return(bytedata)
This allow our call to be:
df = pyarrow.csv.read_csv(
StreamWrapper(zipfile_stream),
parse_options = ...
)
Related
I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.
TLDR
I am looking for an existing convention to encode / serialize tree-like data in a directory structure, split into small files instead of one big file.
Background
There are different scenarios where we want to store tree-like data in a file, which can then be tracked in git. Json files can express dependencies for a package manager (e.g. composer for php, npm for node.js). Yml files can define routes, test cases, etc.
Typically a "tree structure" is a combination of key-value lists and "serial" lists, where each value can again be a tree structure.
Very often the order of associative keys is irrelevant, and should ideally be normalized to alphabetic order.
One problem when storing a big tree structure in a single file, be it json or yml, which is then tracked with git, is that you get plenty of merge conflicts if different branches add and remove entries in the same key-value list.
Especially for key-value lists where the order is irrelevant, it would be more git-friendly to store each sub-tree in a separate file or directory, instead of storing them all in one big file.
Technically it should be possible to create a directory structure that is as expressive as json or yml.
Performance concerns can be overcome with caching. If the files are going to be tracked in git, we can assume they are going to be unchanged most of the time.
The main challenges:
- How to deal with "special characters" that cause problems in some or most file systems, if used in a file or directory name?
- If I need to encode or disambiguate special characters, how can I still keep it pleasant to the eye?
- How to deal with limitations to file name length in some file systems?
- How to deal with other file system quirks, e.g. case insensitivity? Is this even still a thing?
- How to express serial lists, which might contain key-value lists as children? Serial lists cannot be expressed as directories, so its children have to live within the same file.
- How can I avoid to reinvent the wheel, creating my own made-up "convention" that nobody else uses?
Desired features:
- As expressive as json or yml.
- git-friendly.
- Machine-readable and -writable.
- Human-readable and -editable, perhaps with limitations.
- Ideally it should use known formats (json, yml) for structures and values that are expressed within a single file.
Naive approach
Of course the first idea would be to use yml files for literal values and serial lists, and directories for key-value lists (in cases where the order does not matter). In a key-value list, the file or directory names are interpreted as keys, the files and subdirectories as values.
This has some limitations, because not every possible key that would be valid in json or yml is also a valid file name in every file system. The most obvious example would be a slash.
Question
I have different ideas how I would do this myself.
But I am really looking for some kind of convention for this that already exists.
Related questions
Persistence: Data Trees stored as Directory Trees
This is asking about performance, and about using the filesystem like a database - I think.
I am less interested in performance (caching makes it irrelevant), and more about the actual storage format / convention.
The closest thing I can think of that could be seen as some kind of convention for doing this are Linux configuration files. In modern Linux, you often split the configuration of a service into multiple files residing in a certain directory, e.g. /etc/exim4/conf.d/ instead of having a single file /etc/exim/exim4.conf. There are multiple reasons doing this:
Some configuration may be provided by the package manager (e.g. linking to other services that are installed via package manager), while other parts are user-defined. Since there would be a conflict if the user edits a file provided by the package manager, they can instead create a new file and enter additional configuration there.
For large configuration files (like for exim4), it's easier to navigate the configuration if you have multiple files for different concerns (hardcore vim users might disagree).
You can enable / disable parts of the configuration easier by renaming / moving the file that contains a certain part.
We can learn a bit from this: Separation into distinct files should happen if the semantic of the content is orthogonal, i.e. the semantic of one file does not depend on the semantic of another file. This is of course a rule for sibling files; we cannot really deduct rules for serializing a tree structure as directory tree from it. However, we can definitely see reasons for not splitting every value in an own file.
You mention problems of encoding special characters into a file name. You will only have this problem if you go against conventions! The implicit convention on file and directory names is that they act as locator / ID for files, never as content. Again, we can learn a bit from Linux config files: Usually, there is a master file that contains an include statement which loads all the split files. The include statement gives a path glob expression which locates the other files. The path to those files is irrelevant for the semantics of their content. Technically, we can do something similar with YAML.
Assume we want to split this single YAML file into multiple files (pardon my lack of creativity):
spam:
spam: spam
egg: sausage
baked beans:
- spam
- spam
- bacon
A possible transformation would be this (read stuff ending with / as directory, : starts file content):
confdir/
main.yaml:
spam: !include spammap/main.yaml
baked beans: !include beans/
spammap/
main.yaml:
spam: !include spam.yaml
egg: !include egg.yaml
spam.yaml:
spam
egg.yaml:
sausage
beans/
1.yaml:
spam
2.yaml:
spam
3.yaml:
bacon
(In YAML, !include is a local tag. With most implementations, you can register a custom constructor for it, thus loading the whole hierarchy as single document.)
As you can see, I put every hierarchy level and every value into a separate file. I use two kinds of includes: A reference to a file will load the content of that file; a reference to a directory will generate a sequence where each item's value is the content of one file in that directory, sorted by file name. As you can see, the file and directory names are never part of the content, sometimes I opted to name them differently (e.g. baked beans -> beans/) to avoid possible file system problems (spaces in filenames in this case – usually not a serious problem nowadays). Also, I adhere to the filename extension convention (having the files carry .yaml). This would be more quirky if you put content into the file names.
I named the starting file on each level main.yaml (not needed in beans/ since it's a sequence). While the exact name is arbitrary, this is a convention used in several other tools, e.g. Python with __init__.py or the Nix package manager with default.nix. Then I placed additional files or directories besides this main file.
Since including other files is explicit, it is not a problem with this approach to put a larger part of the content into a single file. Note that JSON lacks YAML's tags functionality, but you can still walk through a loaded JSON file and preprocess values like {"!include": "path"}.
To sum up: While there is not directly a convention how to do what you want, parts of the problem have been solved at different places and you can inherit wisdom from that.
Here's a minimal working example of how to do it with PyYAML. This is just a proof of concept; several features are missing (e.g. autogenerated file names will be ascending numbers, no support for serializing lists into directories). It shows what needs to be done to store information about the data layout while being transparent to the user (data can be accessed like a normal dict structure). It remembers file names stuff has been loaded from and stores to those files again.
import os.path
from pathlib import Path
import yaml
from yaml.reader import Reader
from yaml.scanner import Scanner
from yaml.parser import Parser
from yaml.composer import Composer
from yaml.constructor import SafeConstructor
from yaml.resolver import Resolver
from yaml.emitter import Emitter
from yaml.serializer import Serializer
from yaml.representer import SafeRepresenter
class SplitValue(object):
"""This is a value that should be written into its own YAML file."""
def __init__(self, content, path = None):
self._content = content
self._path = path
def getval(self):
return self._content
def setval(self, value):
self._content = value
def __repr__(self):
return self._content.__repr__()
class TransparentContainer(object):
"""Makes SplitValues transparent to the user."""
def __getitem__(self, key):
val = super(TransparentContainer, self).__getitem__(key)
return val.getval() if isinstance(val, SplitValue) else val
def __setitem__(self, key, value):
val = super(TransparentContainer, self).__getitem__(key)
if isinstance(val, SplitValue) and not isinstance(value, SplitValue):
val.setval(value)
else:
super(TransparentContainer, self).__setitem__(key, value)
class TransparentList(TransparentContainer, list):
pass
class TransparentDict(TransparentContainer, dict):
pass
class DirectoryAwareFileProcessor(object):
def __init__(self, path, mode):
self._basedir = os.path.dirname(path)
self._file = open(path, mode)
def close(self):
try:
self._file.close()
finally:
self.dispose() # implemented by PyYAML
# __enter__ / __exit__ to use this in a `with` construct
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.close()
class FilesystemLoader(DirectoryAwareFileProcessor, Reader, Scanner,
Parser, Composer, SafeConstructor, Resolver):
"""Loads YAML file from a directory structure."""
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'r')
Reader.__init__(self, self._file)
Scanner.__init__(self)
Parser.__init__(self)
Composer.__init__(self)
SafeConstructor.__init__(self)
Resolver.__init__(self)
def split_value_constructor(loader, node):
path = loader.construct_scalar(node)
with FilesystemLoader(os.path.join(loader._basedir, path)) as childLoader:
return SplitValue(childLoader.get_single_data(), path)
FilesystemLoader.add_constructor(u'!include', split_value_constructor)
def transp_dict_constructor(loader, node):
ret = TransparentDict()
ret.update(loader.construct_mapping(node, deep=True))
return ret
# override constructor for !!map, the default resolved tag for mappings
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
transp_dict_constructor)
def transp_list_constructor(loader, node):
ret = TransparentList()
ret.append(loader.construct_sequence(node, deep=True))
return ret
# like above, for !!seq
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_SEQUENCE_TAG,
transp_list_constructor)
class FilesystemDumper(DirectoryAwareFileProcessor, Emitter,
Serializer, SafeRepresenter, Resolver):
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'w')
Emitter.__init__(self, self._file)
Serializer.__init__(self)
SafeRepresenter.__init__(self)
Resolver.__init__(self)
self.__next_unique_name = 1
Serializer.open(self)
def gen_unique_name(self):
val = self.__next_unique_name
self.__next_unique_name = self.__next_unique_name + 1
return str(val)
def close(self):
try:
Serializer.close(self)
finally:
DirectoryAwareFileProcessor.close(self)
def split_value_representer(dumper, data):
if data._path is None:
if isinstance(data._content, TransparentContainer):
data._path = os.path.join(dumper.gen_unique_name(), "main.yaml")
else:
data._path = dumper.gen_unique_name() + ".yaml"
Path(os.path.dirname(data._path)).mkdir(exist_ok=True)
with FilesystemDumper(os.path.join(dumper._basedir, data._path)) as childDumper:
childDumper.represent(data._content)
return dumper.represent_scalar(u'!include', data._path)
yaml.add_representer(SplitValue, split_value_representer, FilesystemDumper)
def transp_dict_representer(dumper, data):
return dumper.represent_dict(data)
yaml.add_representer(TransparentDict, transp_dict_representer, FilesystemDumper)
def transp_list_representer(dumper, data):
return dumper.represent_list(data)
# example usage:
# explicitly specify values that should be split.
myData = TransparentDict({
"spam": SplitValue({
"spam": SplitValue("spam", "spam.yaml"),
"egg": SplitValue("sausage", "sausage.yaml")}, "spammap/main.yaml")})
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(myData)
# load values from stored files.
# The loaded data remembers which values have been in which files.
with FilesystemLoader("root.yaml") as loader:
loaded = loader.get_single_data()
# modify a value as if it was a normal structure.
# actually updates a SplitValue
loaded["spam"]["spam"] = "baked beans"
# dumps the same structure as before, with the modified value.
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(loaded)
I have a 42 GB jsonl file. Every element of this file is a json object. I create training samples from every json object. But the number of training samples from every json object that I extract can vary between 0 to 5 samples. What is the best way to create a custom PyTorch dataset without reading the entire jsonl file in memory?
This is the dataset I am talking about - Google Natural Questions.
You have a couple of options.
The simplest option, if having lots of small files is not a problem, is to preprocess each json object into a single file. Then you can just read each one depending on the index requested. E.g
class SingleFileDataset(Dataset):
def __init__(self, list_of_file_paths):
self.list_of_file_paths = list_of_file_paths
def __getitem__(self, index):
return np.load(self.list_of_file_paths[index]) # Or equivalent reading code for single file
You can also split the data into a constant number of files, and then calculate, given the index, which file the sample resides in. Then you need to open that file into memory and read the appropriate index. This gives a trade-off between disk access and memory usage. Assume you have n samples, and we split the samples into c files evenly during preprocessing. Now, to read the sample at index i we would do
class SplitIntoFilesDataset(Dataset):
def __init__(self, list_of_file_paths, n_splits):
self.list_of_file_paths = list_of_file_paths
self.n_splits = n_splits
def __getitem__(self, index):
# index // n_splits is the relevant file, and
# index % len(self) is the index in in that file
file_to_load = self.list_of_file_paths[index // self.n_splits]
# Load file
file = np.load(file)
datapoint = file[index % len(self)]
Finally, you could use a HDF5 file that allows access to rows on disk. This is possibly the best solution if you have a lot of data, since the data will be close on disk. There's an implementation here which I have copy pasted below:
import h5py
import torch
import torch.utils.data as data
class H5Dataset(data.Dataset):
def __init__(self, file_path):
super(H5Dataset, self).__init__()
h5_file = h5py.File(file_path)
self.data = h5_file.get('data')
self.target = h5_file.get('label')
def __getitem__(self, index):
return (torch.from_numpy(self.data[index,:,:,:]).float(),
torch.from_numpy(self.target[index,:,:,:]).float())
def __len__(self):
return self.data.shape[0]
I'm using boto to access a dynamodb table. Everything was going well until I tried to perform a scan operation.
I've tried a couple of syntaxes I've found after repeated searches of The Internet, but no luck:
def scanAssets(self, asset):
results = self.table.scan({('asset', 'EQ', asset)})
-or-
results = self.table.scan(scan_filter={'asset':boto.dynamodb.condition.EQ(asset)})
The attribute I'm scanning for is called 'asset', and asset is a string.
The odd thing is the table.scan call always ends up going through this function:
def dynamize_scan_filter(self, scan_filter):
"""
Convert a layer2 scan_filter parameter into the
structure required by Layer1.
"""
d = None
if scan_filter:
d = {}
for attr_name in scan_filter:
condition = scan_filter[attr_name]
d[attr_name] = condition.to_dict()
return d
I'm not a python expert, but I don't see how this would work. I.e. what kind of structure would scan_filter have to be to get through this code?
Again, maybe I'm just calling it wrong. Any suggestions?
OK, looks like I had an import problem. Simply using:
import boto
and specifying boto.dynamodb.condition doesn't cut it. I had to add:
import dynamodb.condition
to get the condition type to get picked up. My now working code is:
results = self.table.scan(scan_filter={'asset': dynamodb.condition.EQ(asset)})
Not that I completely understand why, but it's working for me now. :-)
Or you can do this
exclusive_start_key = None
while True:
result_set = self.table.scan(
asset__eq=asset, # The scan filter is explicitly given here
max_page_size=100, # Number of entries per page
limit=100,
# You can divide the table by n segments so that processing can be done parallelly and quickly.
total_segments=number_of_segments,
segment=segment, # Specify which segment you want to process
exclusive_start_key=exclusive_start_key # To start for last key seen
)
dynamodb_items = map(lambda item: item, result_set)
# Do something with your item, add it to a list for later processing when you come out of the while loop
exclusive_start_key = result_set._last_key_seen
if not exclusive_start_key:
break
This is applicable for any field.
segmentation: suppose you have above script in test.py
you can run parallelly like
python test.py --segment=0 --total_segments=4
python test.py --segment=1 --total_segments=4
python test.py --segment=2 --total_segments=4
python test.py --segment=3 --total_segments=4
in different screens
In Scrapy, I have my items specified in a certain order in items.py, & my spider has those items again in the same order. However, when I run the spider & save the results as a csv, the column order from the items.py or the spider is not maintained. How can I get the CSV to show columns in a specific order. Example code would be very appreciated.
Thanks.
This is related to Modifiying CSV export in scrapy
The problem is that the exporter is instantiated without any keyword parameters, so the keywords like EXPORT_FIELDS are ignored. The solution is the same: you need to subclass the CSV item exporter to pass the keyword parameters.
Following the above recipe, I created a new file xyzzy/feedexport.py (change "xyzzy" to whatever your scrapy class is named):
"""
The standard CSVItemExporter class does not pass the kwargs through to the
CSV writer, resulting in EXPORT_FIELDS and EXPORT_ENCODING being ignored
(EXPORT_EMPTY is not used by CSV).
"""
from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter
class CSVkwItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
kwargs['fields_to_export'] = settings.getlist('EXPORT_FIELDS') or None
kwargs['encoding'] = settings.get('EXPORT_ENCODING', 'utf-8')
super(CSVkwItemExporter, self).__init__(*args, **kwargs)
and then added it into xyzzy/settings.py:
FEED_EXPORTERS = {
'csv': 'xyzzy.feedexport.CSVkwItemExporter'
}
Now the CSV exporter will honor the EXPORT_FIELD setting - also add to xyzzy/settings.py:
# By specifying the fields to export, the CSV export honors the order
# rather than using a random order.
EXPORT_FIELDS = [
'field1',
'field2',
'field3',
]
I wouldn't know about the time you asked your question but Scrapy now provides a fields_to_export attribute to the BaseItemExporter class, from which CsvItemExporter inherits.
As per version 0.22:
fields_to_export
A list with the name of the fields that will be exported, or None if you want to export all fields. Defaults to None.
Some exporters (like CsvItemExporter) respect the order
of the fields defined in this attribute.
See also the documentation for BaseItemExporter and CsvItemExporter on the Scrapy website.
In order to use this feature, though, you will have to create your own ItemPipeline, as detailed in this answer