How can I process large files in Code Repositories? - palantir-foundry

I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?

The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.

Related

Palantir Foundry How to allow dynamic number of input in compute (Code repository)

I have a folder where I will upload one file every month. The file will have the same format in every month.
First problem
The idea is to concatenate all the files in this folder into one file. Currently I am hardcoding the filenames (filename[0], filename[1], filename[2]..) but imagine later I will have 50 files, should I explicitly add them to the transform_df decorator? Is there any other method to handle this?
Second problem:
Currently I have let's say 4 files (2021_07, 2021_08, 2021_09, 2021_10) and I want whenever I add the file presenting 2021_12 data to avoid changing the code.
If I add input_5 = Input(path_to_2021_12_do_not_exists) the code will not be run and give an error.
How can I implement the code for future files and let the code ignore the input if it does not exist without manually each month add a new value to my code?
Thank you
# from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
from functools import reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
#transform_df(
Output("/Company/Project_name/Data/clean/File_concat"),
input_1=Input(filenames[0]),
input_2=Input(filenames[1]),
input_3=Input(filenames[2]),
input_4=Input(filenames[3]),
)
def compute(input_1, input_2, input_3, input_4):
input_dfs = [input_1, input_2, input_3, input_4]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
dfs = reduce(DataFrame.unionByName, dfs)
return dfs
This question comes up a lot, the simple answer is that you don't. Defining datasets and executing a build on them are two different steps executed at different stages.
Whenever you commit your code and run the checks, your overall python code is executed during the renderSchrinkwrap stage, except for the compute part. This allows Foundry to discover what datasets exist and publish.
Publishing involves creating your dataset and putting whatever is inside your compute function is published into the jobspec of the dataset, so foundry knows what code to execute whenever you run a build.
Once you hit build on the dataset, Foundry will only pick up whatever is on the jobspec and execute it. Any other code has already run during your checks, and it has run just once.
So any dynamic input/output would require you to re-run checks on your repo, which means that some code change would have had to happen since the Checks is part of the CI process, not part of the build.
Taking a step back, assuming each of your input files has the same schema, Foundry would expect you to have all of those files in the same dataset as append transactions.
This might not be possible though, if for instance, the only indication of the "year" of the data is embedded in the filename, but your sample code would indicate that you expect all these datasets to have the same schema and easily union together.
You can do this manually through the Dataset Preview - just use the Upload File button or drag-and-drop the new file into the Preview window - or, if it's an "end user" workflow, with a File Upload Widget in a Workshop app. You may need to coordinate with your Foundry support team if this widget isn't available.
Bit late to the post although for anyone who is interested in an answer to most of the question. Dynamically determining file names from within a folder is not doable although having some level of dynamic input is possible as follows:
# from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
# from functools import reduce
from transforms.verbs.dataframes import union_many # use this instead of reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
inputs = {('input{}'.format(index)): Input(filename) for (index, filename) in enumerate(filenames))}
#transform(
output=Output("/Company/Project_name/Data/clean/File_concat"),
**inputs
)
def compute(output, **kwargs):
# Extract dataframes from input datasets
input_dfs = [dataset_df.dataframe() for dataset_name, dataset_df in kwargs.items()]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
# dfs = reduce(DataFrame.unionByName, dfs)
unioned_dfs = union_many(*dfs)
return unioned_dfs
Couple points:
Created dynamic input dict.
That dict is read into the transform using **kwargs.
Using transform decorator not transform_df, we can extract the dataframes.
(not in question) Combine multiple dataframes using union_many function from transforms_verbs library.

Psychopy: how to avoid to store variables in the csv file?

When I run my PsychoPy experiment, PsychoPy saves a CSV file that contains my trials and the values of my variables.
Among these, there are some variables I would like to NOT be included. There are some variables which I decided to include in the CSV, but many others which automatically felt in it.
is there a way to manually force (from the code block) the exclusion of some variables in the CSV?
is there a way to decide the order of the saved columns/variables in the CSV?
It is not really important and I know I could just create myself an output file without using the one of PsychoPy, or I can easily clean it afterwards but I was just curious.
PsychoPy spits out all the variables it thinks you could need. If you want to drop some of them, that is a task for the analysis stage, and is easily done in any processing pipeline. Unless you are analysing data in a spreadsheet (which you really shouldn't), the number of columns in the output file shouldn't really be an issue. The philosophy is that you shouldn't back yourself into a corner by discarding data at the recording stage - what about the reviewer who asks about the influence of a variable that you didn't think was important?
If you are using the Builder interface, the saving of onset & offset times for each component is optional, and is controlled in the "data" tab of each component dialog.
The order of variables is also not under direct control of the user, but again, can be easily manipulated at the analysis stage.
As you note, you can of course write code to save custom output files of your own design.
there is a special block called session_variable_order: [var1, var2, var3] in experiment_config.yaml file, which you probably should be using; also, you should consider these methods:
from psychopy import data
data.ExperimentHandler.saveAsWideText(fileName = 'exp_handler.csv', delim='\t', sortColumns = False, encoding = 'utf-8')
data.TrialHandler.saveAsText(fileName = 'trial_handler.txt', delim=',', encoding = 'utf-8', dataOut = ('n', 'all_mean', 'all_raw'), summarised = False)
notice the sortColumns and dataOut params

Store json-like hierarchical data as nested directory tree?

TLDR
I am looking for an existing convention to encode / serialize tree-like data in a directory structure, split into small files instead of one big file.
Background
There are different scenarios where we want to store tree-like data in a file, which can then be tracked in git. Json files can express dependencies for a package manager (e.g. composer for php, npm for node.js). Yml files can define routes, test cases, etc.
Typically a "tree structure" is a combination of key-value lists and "serial" lists, where each value can again be a tree structure.
Very often the order of associative keys is irrelevant, and should ideally be normalized to alphabetic order.
One problem when storing a big tree structure in a single file, be it json or yml, which is then tracked with git, is that you get plenty of merge conflicts if different branches add and remove entries in the same key-value list.
Especially for key-value lists where the order is irrelevant, it would be more git-friendly to store each sub-tree in a separate file or directory, instead of storing them all in one big file.
Technically it should be possible to create a directory structure that is as expressive as json or yml.
Performance concerns can be overcome with caching. If the files are going to be tracked in git, we can assume they are going to be unchanged most of the time.
The main challenges:
- How to deal with "special characters" that cause problems in some or most file systems, if used in a file or directory name?
- If I need to encode or disambiguate special characters, how can I still keep it pleasant to the eye?
- How to deal with limitations to file name length in some file systems?
- How to deal with other file system quirks, e.g. case insensitivity? Is this even still a thing?
- How to express serial lists, which might contain key-value lists as children? Serial lists cannot be expressed as directories, so its children have to live within the same file.
- How can I avoid to reinvent the wheel, creating my own made-up "convention" that nobody else uses?
Desired features:
- As expressive as json or yml.
- git-friendly.
- Machine-readable and -writable.
- Human-readable and -editable, perhaps with limitations.
- Ideally it should use known formats (json, yml) for structures and values that are expressed within a single file.
Naive approach
Of course the first idea would be to use yml files for literal values and serial lists, and directories for key-value lists (in cases where the order does not matter). In a key-value list, the file or directory names are interpreted as keys, the files and subdirectories as values.
This has some limitations, because not every possible key that would be valid in json or yml is also a valid file name in every file system. The most obvious example would be a slash.
Question
I have different ideas how I would do this myself.
But I am really looking for some kind of convention for this that already exists.
Related questions
Persistence: Data Trees stored as Directory Trees
This is asking about performance, and about using the filesystem like a database - I think.
I am less interested in performance (caching makes it irrelevant), and more about the actual storage format / convention.
The closest thing I can think of that could be seen as some kind of convention for doing this are Linux configuration files. In modern Linux, you often split the configuration of a service into multiple files residing in a certain directory, e.g. /etc/exim4/conf.d/ instead of having a single file /etc/exim/exim4.conf. There are multiple reasons doing this:
Some configuration may be provided by the package manager (e.g. linking to other services that are installed via package manager), while other parts are user-defined. Since there would be a conflict if the user edits a file provided by the package manager, they can instead create a new file and enter additional configuration there.
For large configuration files (like for exim4), it's easier to navigate the configuration if you have multiple files for different concerns (hardcore vim users might disagree).
You can enable / disable parts of the configuration easier by renaming / moving the file that contains a certain part.
We can learn a bit from this: Separation into distinct files should happen if the semantic of the content is orthogonal, i.e. the semantic of one file does not depend on the semantic of another file. This is of course a rule for sibling files; we cannot really deduct rules for serializing a tree structure as directory tree from it. However, we can definitely see reasons for not splitting every value in an own file.
You mention problems of encoding special characters into a file name. You will only have this problem if you go against conventions! The implicit convention on file and directory names is that they act as locator / ID for files, never as content. Again, we can learn a bit from Linux config files: Usually, there is a master file that contains an include statement which loads all the split files. The include statement gives a path glob expression which locates the other files. The path to those files is irrelevant for the semantics of their content. Technically, we can do something similar with YAML.
Assume we want to split this single YAML file into multiple files (pardon my lack of creativity):
spam:
spam: spam
egg: sausage
baked beans:
- spam
- spam
- bacon
A possible transformation would be this (read stuff ending with / as directory, : starts file content):
confdir/
main.yaml:
spam: !include spammap/main.yaml
baked beans: !include beans/
spammap/
main.yaml:
spam: !include spam.yaml
egg: !include egg.yaml
spam.yaml:
spam
egg.yaml:
sausage
beans/
1.yaml:
spam
2.yaml:
spam
3.yaml:
bacon
(In YAML, !include is a local tag. With most implementations, you can register a custom constructor for it, thus loading the whole hierarchy as single document.)
As you can see, I put every hierarchy level and every value into a separate file. I use two kinds of includes: A reference to a file will load the content of that file; a reference to a directory will generate a sequence where each item's value is the content of one file in that directory, sorted by file name. As you can see, the file and directory names are never part of the content, sometimes I opted to name them differently (e.g. baked beans -> beans/) to avoid possible file system problems (spaces in filenames in this case – usually not a serious problem nowadays). Also, I adhere to the filename extension convention (having the files carry .yaml). This would be more quirky if you put content into the file names.
I named the starting file on each level main.yaml (not needed in beans/ since it's a sequence). While the exact name is arbitrary, this is a convention used in several other tools, e.g. Python with __init__.py or the Nix package manager with default.nix. Then I placed additional files or directories besides this main file.
Since including other files is explicit, it is not a problem with this approach to put a larger part of the content into a single file. Note that JSON lacks YAML's tags functionality, but you can still walk through a loaded JSON file and preprocess values like {"!include": "path"}.
To sum up: While there is not directly a convention how to do what you want, parts of the problem have been solved at different places and you can inherit wisdom from that.
Here's a minimal working example of how to do it with PyYAML. This is just a proof of concept; several features are missing (e.g. autogenerated file names will be ascending numbers, no support for serializing lists into directories). It shows what needs to be done to store information about the data layout while being transparent to the user (data can be accessed like a normal dict structure). It remembers file names stuff has been loaded from and stores to those files again.
import os.path
from pathlib import Path
import yaml
from yaml.reader import Reader
from yaml.scanner import Scanner
from yaml.parser import Parser
from yaml.composer import Composer
from yaml.constructor import SafeConstructor
from yaml.resolver import Resolver
from yaml.emitter import Emitter
from yaml.serializer import Serializer
from yaml.representer import SafeRepresenter
class SplitValue(object):
"""This is a value that should be written into its own YAML file."""
def __init__(self, content, path = None):
self._content = content
self._path = path
def getval(self):
return self._content
def setval(self, value):
self._content = value
def __repr__(self):
return self._content.__repr__()
class TransparentContainer(object):
"""Makes SplitValues transparent to the user."""
def __getitem__(self, key):
val = super(TransparentContainer, self).__getitem__(key)
return val.getval() if isinstance(val, SplitValue) else val
def __setitem__(self, key, value):
val = super(TransparentContainer, self).__getitem__(key)
if isinstance(val, SplitValue) and not isinstance(value, SplitValue):
val.setval(value)
else:
super(TransparentContainer, self).__setitem__(key, value)
class TransparentList(TransparentContainer, list):
pass
class TransparentDict(TransparentContainer, dict):
pass
class DirectoryAwareFileProcessor(object):
def __init__(self, path, mode):
self._basedir = os.path.dirname(path)
self._file = open(path, mode)
def close(self):
try:
self._file.close()
finally:
self.dispose() # implemented by PyYAML
# __enter__ / __exit__ to use this in a `with` construct
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.close()
class FilesystemLoader(DirectoryAwareFileProcessor, Reader, Scanner,
Parser, Composer, SafeConstructor, Resolver):
"""Loads YAML file from a directory structure."""
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'r')
Reader.__init__(self, self._file)
Scanner.__init__(self)
Parser.__init__(self)
Composer.__init__(self)
SafeConstructor.__init__(self)
Resolver.__init__(self)
def split_value_constructor(loader, node):
path = loader.construct_scalar(node)
with FilesystemLoader(os.path.join(loader._basedir, path)) as childLoader:
return SplitValue(childLoader.get_single_data(), path)
FilesystemLoader.add_constructor(u'!include', split_value_constructor)
def transp_dict_constructor(loader, node):
ret = TransparentDict()
ret.update(loader.construct_mapping(node, deep=True))
return ret
# override constructor for !!map, the default resolved tag for mappings
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
transp_dict_constructor)
def transp_list_constructor(loader, node):
ret = TransparentList()
ret.append(loader.construct_sequence(node, deep=True))
return ret
# like above, for !!seq
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_SEQUENCE_TAG,
transp_list_constructor)
class FilesystemDumper(DirectoryAwareFileProcessor, Emitter,
Serializer, SafeRepresenter, Resolver):
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'w')
Emitter.__init__(self, self._file)
Serializer.__init__(self)
SafeRepresenter.__init__(self)
Resolver.__init__(self)
self.__next_unique_name = 1
Serializer.open(self)
def gen_unique_name(self):
val = self.__next_unique_name
self.__next_unique_name = self.__next_unique_name + 1
return str(val)
def close(self):
try:
Serializer.close(self)
finally:
DirectoryAwareFileProcessor.close(self)
def split_value_representer(dumper, data):
if data._path is None:
if isinstance(data._content, TransparentContainer):
data._path = os.path.join(dumper.gen_unique_name(), "main.yaml")
else:
data._path = dumper.gen_unique_name() + ".yaml"
Path(os.path.dirname(data._path)).mkdir(exist_ok=True)
with FilesystemDumper(os.path.join(dumper._basedir, data._path)) as childDumper:
childDumper.represent(data._content)
return dumper.represent_scalar(u'!include', data._path)
yaml.add_representer(SplitValue, split_value_representer, FilesystemDumper)
def transp_dict_representer(dumper, data):
return dumper.represent_dict(data)
yaml.add_representer(TransparentDict, transp_dict_representer, FilesystemDumper)
def transp_list_representer(dumper, data):
return dumper.represent_list(data)
# example usage:
# explicitly specify values that should be split.
myData = TransparentDict({
"spam": SplitValue({
"spam": SplitValue("spam", "spam.yaml"),
"egg": SplitValue("sausage", "sausage.yaml")}, "spammap/main.yaml")})
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(myData)
# load values from stored files.
# The loaded data remembers which values have been in which files.
with FilesystemLoader("root.yaml") as loader:
loaded = loader.get_single_data()
# modify a value as if it was a normal structure.
# actually updates a SplitValue
loaded["spam"]["spam"] = "baked beans"
# dumps the same structure as before, with the modified value.
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(loaded)

How to load image from csv file in tensorflow

I have image save in 0.csv files.
The format is as picture below.
How can I read it to tensorflow?
Thanks!
You should use the Dataset input pipeline introduced in tensorflow 1.4:
https://www.tensorflow.org/programmers_guide/datasets#consuming_text_data
Here's the example from the developers guide (though you'll want to read through that guide, it's quite well written):
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
# Use `Dataset.flat_map()` to transform each file as a separate nested dataset,
# and then concatenate their contents sequentially into a single "flat" dataset.
# * Skip the first line (header row).
# * Filter out lines beginning with "#" (comments).
dataset = dataset.flat_map(
lambda filename: (
tf.data.TextLineDataset(filename)
.skip(1)
.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), "#"))))
The Dataset preprocessing pipeline has a few nice advantages. Most of the functionality you'll need such as reading text records, shuffling, batching, etc. are reduced to one-liners. More importantly though, it forces you into writing your preprocessing pipeline in a good, modular, testable way. It takes a little bit to get used to the API, but it's time well spent.

How to upload multiple JSON files into CouchDB

I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.
python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.
So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).
You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.