How do I create a lot of sample data for firestore? - json

Let's say I need to create a lot of different documents/collections in firestore. I need to add it quickly, like copy and paste json. I can't do that with standard firebase console, because adding 100 documents will take me forever. Is there any solutions for to bulk create mock data with a given structure in firestore db?

If you switch to the Cloud Console (rather than Firebase Console) for your project, you can use Cloud Shell as a starting point.
From the Cloud Shell environment you'll find tools like node and python installed and available. Using whatever one you prefer, you can write a script using the Server Client libraries.
For example in Python:
from google.cloud import firestore
import random
MAX_DOCUMENTS = 100
SAMPLE_COLLECTION_ID = u'users'
SAMPLE_COLORS = [u'Blue', u'Red', u'Green', u'Yellow', u'White', u'Black']
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
collection_ref = db.collection(SAMPLE_COLLECTION_ID)
for x in range(0, MAX_DOCUMENTS - 1):
collection_ref.add({
u'primary': random.choice(SAMPLE_COLORS),
u'secondary': random.choice(SAMPLE_COLORS),
u'trim': random.choice(SAMPLE_COLORS),
u'accent': random.choice(SAMPLE_COLORS)
})
While this is the easiest way to get up and running with a static dataset, it lives a little to be desired. Namely with Firestore, live dynamic data is needed to exercises it's functionally, such as real-time queries. For this task, using Cloud Scheduler & Cloud Functions is a relatively easy way to regularly updating sample data.
In addition to the sample generation code, you'll specific the update frequency in the Cloud Scheduler. For instance in the image below, */10 * * * * defines a frequency of every 10 minutes using the standard unix-cron format:
For non-static data, often a timestamp is useful. Firestore provides a way to have a timestamp from the database server added at write-time as one of the fields:
u'timestamp': firestore.SERVER_TIMESTAMP
It is worth noting that timestamps like this will hotspot in production systems if not sharded correctly. Typically 500 writes/second to the same collection is the maximum you will want so that the index doesn't hotspot. Sharding can be as simple something like as each user having their own collection (500 writes/second per user). However for this example, writing 100 documents every minute via a scheduled Cloud Function is definitely not an issue.

FireKit is a good resource to use for this purpose. It even allows sub-collections.
https://retroportalstudio.gumroad.com/l/firekit_free

Related

Execution ID on Google Cloud Run

I am wondering if there exists an execution id into Cloud Run as the one into Google Cloud Functions?
An ID that identifies each invocation separately, it's very useful to use the "Show matching entries" in Cloud Logging to get all logs related to an execution.
I understand the execution process is different, Cloud Run allows concurrency, but is there a workaround to assign each log to a certain execution?
My final need is to group at the same line the request and the response. Because, as for now, I am printing them separately and if a few requests arrive at the same time, I can't see what response corresponds to what request...
Thank you for your attention!
Open Telemetry looks like a great solution, but the learning and manipulation time isn't negligible,
I'm going with a custom id created in before_request, stored in Flask g and called at every print().
#app.before_request
def before_request_func():
execution_id = uuid.uuid4()
g.execution_id = execution_id

Data Studio connector making multiple calls to API when it should only be making 1

I'm finalizing a Data Studio connector and noticing some odd behavior with the number of API calls.
Where I'm expecting to see a single API call, I'm seeing multiple calls.
In my apps script I'm keeping a simple tally which increments by 1 every url fetch and that is giving me the correct number I expect to see with getData().
However, in my API monitoring logs (using Runscope) I'm seeing multiple API requests for the same endpoint, and varying numbers for different endpoints in a single getData() call (they should all be the same). E.g.
I can't post the code here (client project) but it's substantially the same framework as the Data Connector code on Google's docs. I have caching and backoff implemented.
Looking for any ideas or if anyone has experienced something similar?
Thanks
Per the this reference, GDS will also perform semantic type detection if you aren't explicitly defining this property for your fields. If the query is semantic type detection, the request will feature sampleExtraction: true
When Data Studio executes the getData function of a community connector for the purpose of semantic detection, the incoming request will contain a sampleExtraction property which will be set to true.
If the GDS report includes multiple widgets with different dimensions/metrics configuration then GDS might fire multiple getData calls for each of them.
Kind of a late answer but this might help others who are facing the same problem.
The widgets / search filters attached to a graph issue getData calls of their own. If your custom adapter is built to retrieve data via API calls from third party services, data which is agnostic to the request.fields property sent forward by GDS => then these API calls are multiplied by N+1 (where N = the amout of widgets / search filters your report is implementing).
I could not find an official solution for this either, so I invented a workaround using cache.
The graph's request for getData (typically requesting more fields than the Search Filters) will be the only one allowed to query the API Endpoint. Before starting to do so it will store a key in the cache "cache_{hashOfReportParameters}_building" => true.
if (enableCache) {
cache.putString("cache_{hashOfReportParameters}_building", 'true');
Logger.log("Cache is being built...");
}
It will retrieve API responses, paginating in a look, and buffer the results.
Once it finished it will delete the cache key "cache_{hashOfReportParameters}building", and will cache the final merged results it buffered so far inside "cache{hashOfReportParameters}_final".
When it comes to filters, they also invoke: getData but typically with only up to 3 requested fields. First thing we want to do is make sure they cannot start executing prior to the primary getData call... so we add a little bit of a delay for things that might be the search filters / widgets that are after the same data set:
if (enableCache) {
var countRequestedFields = requestedFields.asArray().length;
Logger.log("Total Requested fields: " + countRequestedFields);
if (countRequestedFields <= 3) {
Logger.log('This seams to be a search filters.');
Utilities.sleep(1000);
}
}
After that we compute a hash on all of the moving parts of the report (date range, plus all of the other parameters you have set up that could influence the data retrieved form your API endpoints):
Now the best part, as long as the main graph is still building the cache, we make these getData calls wait:
while (cache.getString('cache_{hashOfReportParameters}_building') === 'true') {
Logger.log('A similar request is already executing, please wait...');
Utilities.sleep(2000);
}
After this loop we attempt to retrieve the contents of "cache_{hashOfReportParameters}_final" -- and in case we fail, its always a good idea to have a backup plan - which would be to allow it to traverse the API again. We have encountered ~ 2% error rate retrieving data we cached...
With the cached result (or buffered API responses), you just transform your response as per the schema GDS needs (which differs between graphs and filters).
As you start implementing this, you`ll notice yet another problem... Google Cache is limited to max 100KB per key. There is however no limit on the amount of keys you can cache... and fortunately others have encountered similar needs in the past and have come up with a smart solution of splitting up one big chunk you need cached into multiple cache keys, and gluing them back together into one object when retrieving is necessary.
See: https://github.com/lwbuck01/GASs/blob/b5885e34335d531e00f8d45be4205980d91d976a/EnhancedCacheService/EnhancedCache.gs
I cannot share the final solution we have implemented with you as it is too specific to a client - but I hope that this will at least give you a good idea on how to approach the problem.
Caching the full API result is a good idea in general to avoid round trips and server load for no good reason if near-realtime is good enough for your needs.

Storing data in FIWARE Object Storage

I'm building an application that stores files into the FIWARE Object Storage. I don't quite understand what is the correct way of storing files into the storage.
The code python code snippet below taken from the Object Storage - User and Programmers Guide shows 2 ways of doing it:
def store_text(token, auth, container_name, object_name, object_text):
headers = {"X-Auth-Token": token}
# 1. version
#body = '{"mimetype":"text/plain", "metadata":{}, "value" : "' + object_text + '"}'
# 2. version
body = object_text
url = auth + "/" + container_name + "/" + object_name
return swift_request('PUT', url, headers, body)
The 1. version confuses me, because when I first looked at the only Node.js module (repo: fiware-object-storage) that works with Object Storage, it seemed to use 1. version. As the module was making calls to the old (v.1.1) API version instead of the presumably newest (v.2.0), referencing to the python example, not sure if that is an outdated version of doing it or not.
As I played more with the module, realised it didn't work and the code for it was a total mess. So I forked the project and quickly understood that I will need rewrite it form the ground up, taking the above mention python example from the usage guide as an reference. Link to my repo.
As of writing this the only methods that aren't implement is the object storage (PUT) and object fetching (GET).
Had some addition questions about the Object Storage which I sent to fiware-lab-help#lists.fiware.org, but haven't heard anything back so asking them here.
Haven't got much experience with writing API libraries. Should I need to worry about auth token expiring? I presume it is not needed to make a new authentication, every time we interact with storage. The authentication should happen once when server is starting-up (we create a instance) and it internally keeps it. Should I implement some kind of mechanism that refreshes the token?
Does the tenant id change? From the quote below is presume that getting a tenant I just a one time deal, then later you can use it in the config to make less authentication calls.
A valid token is required to access an object store. This section
describes how to get a valid token assuming an identity management
system compatible with OpenStack Keystone is being used. If the
username, password and tenant details are known, only step 3 is
required. source
During the authentication when fetching tenants how should I select the "right" one? For now i'm just taking the first one similar as the example code does.
Is it true that a object storage container belongs to only a single region?
Use only what you call version 2. Ignore your version 1. It is commented out in the example. It should be removed from the documentation.
(1) The token will be valid for some period of time. This could be an hour or a day, depending on the setup. This period of time should be specified in the token that is returned by the authentication service. The token needs to be periodically refreshed.
(2) The tenant id does not change.
(3) Typically only one tenant id is returned. It is possible, however, that you were assigned more than one id, in which case you have to pick which one you are currently using. Containers typically belong to a single tenant and are not shared between tenants.
(4) Containers are typically limited to a single region. This may change in the future when multi-region support for a container is added to Swift.
Solved my troubles and created the NPM module that works with the FIWARE Object Storage: https://github.com/renarsvilnis/fiware-object-storage-ge

How do I save geolocation data for an HTML5 or trigger.io app for path mapping later?

Im creating an app that needs to track the location of the user (with their knowledge, like a running app) so that I can show them their route later. Should I use HTML5 with some timeout interval to save the coordinates every N seconds and if so, how often should I save the data and how should I save it (locally using local storage or post it to the server?)
Also, what is the easiest way to display the map of where the user has been later?
Has anyone done anything like this before?
The timeout interval for forge.geolocation is up to you and the balance of responsiveness of your application. Also, network traffic is expensive. So maybe you can buffer... say the last 10 geopositions... and then Http post (or whatever... see Parse below) in bulk? And since the geo data sounds like temporary device data why would there be a need to persist using forge.prefs? Unless maybe you need to the app to work "offline"?
For permanent storage I would look at Parse (generous free plan) and their Parse.GeoPoint class via their Javascript or REST Api as one possible solution. They have some nifty methods like (kilometersTo, milesTo, radiansTo) - https://parse.com/docs/js/symbols/Parse.GeoPoint.html

SOLR - Best approach to import 20 million documents from csv file

My current task on hand is to figure out the best approach to load millions of documents in solr.
The data file is an export from DB in csv format.
Currently, I am thinking about splitting the file into smaller files and having a script while post this smaller ones using curl.
I have noticed that if u post high amount of data, most of the time the request times out.
I am looking into Data importer and it seems like a good option
Any others ideas highly appreciated
Thanks
Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.
As I see it, you have a couple of options (In my order of preference):
Increase container timeout
Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).
I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.
Split the file
Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):
split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary #$file
done
Parse the file and load in chunks
The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.
import au.com.bytecode.opencsv.CSVReader
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument
#Grapes([
#Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
#Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
#Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
new File("data.csv").withReader { reader ->
CSVReader csv = new CSVReader(reader)
String[] result
Integer count = 1
Integer chunkSize = 500000
while (result = csv.readNext()) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", result[0])
doc.addField("name_s", result[1])
doc.addField("category_s", result[2])
server.add(doc)
if (count.mod(chunkSize) == 0) {
server.commit()
}
count++
}
server.commit()
}
In SOLR 4.0 (currently in BETA), CSV's from a local directory can be imported directly using the UpdateHandler. Modifying the example from the SOLR Wiki
curl http://localhost:8983/solr/update?stream.file=exampledocs/books.csv&stream.contentType=text/csv;charset=utf-8
And this streams the file from the local location, so no need to chunk it up and POST it via HTTP.
Above answers have explained really well the ingestion strategies from single machine.
Few more options if you have big data infrastructure in place and want to implement distributed data ingestion pipeline.
Use sqoop to bring data to hadoop or place your csv file manually in hadoop.
Use one of below connector to ingest data:
hive- solr connector, spark- solr connector.
PS:
Make sure no firewall blocks connectivity between client nodes and solr/solrcloud nodes.
Choose right directory factory for data ingestion, if near real time search is not required then use StandardDirectoryFactory.
If you get below exception in client logs during ingestion then tune autoCommit and autoSoftCommit configuration in solrconfig.xml file.
SolrServerException: No live SolrServers available to handle this
request
Definitely just load these into a normal database first. There's all sorts of tools for dealing with CSVs (for example, postgres' COPY), so it should be easy. Using Data Import Handler is also pretty simple, so this seems like the most friction-free way to load your data. This method will also be faster since you won't have unnecessary network/HTTP overhead.
The reference guide says ConcurrentUpdateSolrServer could/should be used for bulk updates.
Javadocs are somewhat incorrect (v 3.6.2, v 4.7.0):
ConcurrentUpdateSolrServer buffers all added documents and writes them into open HTTP connections.
It doesn't buffer indefinitely, but up to int queueSize, which is a constructor parameter.