I have wrote two functions in firebase, which maintain data. Like daily delete old data.
Question is when I write query to get data. does it count toward my GB downloaded limit which is $1/1GB for Blaze plan.
Since the data is transferred from Firebase Servers (Google Servers) to a user's computer (that is you in this case), you will be charged for all those data transfer into your computer.
Related
Can we use the Ethereum network just like a database to store data. What might be the possible issues that can occur if it is used as a database.
Yes, it's possible. Just write a smart contract to store and retrieve your data.
Google the term "Solidity CRUD" for articles and tutorials for storing data in on Ethereum.
The downsides are:
Speed - Blockchains are slow to write and not fast to read. Ethereum will never be able to compete with even low performance databases like SQLite much less go against Postgres, Oracle or MongoDB.
Cost - Reading from Ethereum is free but writes cost Ether. The exact cost depends on the size of the data you want to store. For small amounts of data this does not matter much. For services you can even make this part of the API that your users will pay writes (such as buying a ticket form you) so it doesn't cost you anything. But if you have gigabytes of legacy data migrating it to the blockchain can be very expensive.
On top of that, doing large data transfer to the blockchain will see demand for transactions spike which will increase the cost per transaction. This is not just theoretical, it has happened before - when the cryptokitties smart contract was launched the game suddenly became so popular that transactions went from less than one cent per transaction to tens of dollars per transaction (USD).
In general you'd want to store only the core data that you need to be secure on Ethereum and link it to other data sources (for example store a URL link and hash of the object but store the object itself on Amazon S3 or Azure Storage)
I want to build a machine learning system with large amount of historical trading data for machine learning purpose (Python program).
Trading company has an API to grab their historical data and real time data. Data volume is about 100G for historical data and about 200M for daily data.
Trading data is typical time series data like price, name, region, timeline, etc. The format of data could be retrieved as large files or stored in relational DB.
So my question is, what is the best way to store these data on AWS and what'sthe best way to add new data everyday (like through a cron job, or ETL job)? Possible solutions include storing them in relational database like Or NoSQL databases like DynamoDB or Redis, or store the data in a file system and read by Python program directly. I just need to find a solution to persist the data in AWS so multiple team can grab the data for research.
Also, since it's a research project, I don't want to spend too much time on exploring new systems or emerging technologies. I know there are Time Series Databases like InfluxDB or new Amazon Timestream. Considering the learning curve and deadline requirement, I don't incline to learn and use them for now.
I'm familiar with MySQL. If really needed, i can pick up NoSQL, like Redis/DynamoDB.
Any advice? Many thanks!
If you want to use AWS EMR, then the simplest solution is probably just to run a daily job that dumps data into a file in S3. However, if you want to use something a little more SQL-ey, you could load everything into Redshift.
If your goal is to make it available in some form to other people, then you should definitely put the data in S3. AWS has ETL and data migration tools that can move data from S3 to a variety of destinations, so the other people will not be restricted in their use of the data just because of it being stored in S3.
On top of that, S3 is the cheapest (warm) storage option available in AWS, and for all practical purposes, its throughout is unlimited. If you store the data in a SQL database, you significantly limit the rate at which the data can be retrieved. If you store the data in a NoSQL database, you may be able to support more traffic (maybe) but it will be at significant cost.
Just to further illustrate my point, I recently did an experiment to test certain properties of one of the S3 APIs, and part of my experiment involved uploading ~100GB of data to S3 from an EC2 instance. I was able to upload all of that data in just a few minutes, and it cost next to nothing.
The only thing you need to decide is the format of your data files. You should talk to some of the other people and find out if Json, CSV, or something else is preferred.
As for adding new data, I would set up a lambda function that is triggered by a CloudWatch event. The lambda function can get the data from your data source and put it into S3. The CloudWatch event trigger is cron based, so it’s easy enough to switch between hourly, daily, or whatever frequency meets your needs.
I'm building a cloud sync application which syncs a users data across multiple devices. I am at a crossroads and am deciding whether to store the data on the server as files, or in a relational database. I am using Amazon Web Services and will use S3 for user files or their database service if I choose to store the data in a table instead. The data I'm storing is the state of the application every ten seconds. This could be problematic to be storing in a database because the average number of rows per user that would be stored is 100,000 and with my current user base of 20,000 people that's 2 billion rows right off the bat. Would I be better off storing that information in files? Because that would be about 100 files totaling 6 megabytes per user.
As discussed in the comments, I would store these as files.
S3 is perfectly suited to be a key/value store and if you're able to diff the changes and ensure that you aren't unnecessarily duplicating loads of data, the sync will be far easier to do by downloading the relevant files from S3 and syncing them client side.
You get a big cost saving of not having to operate a database server that can store tonnes of rows and stay up to provide them to the clients quickly.
My only real concern would be that the data in these files can be difficult to parse if you wanted to aggregate stats/data/info across multiple users as a backend or administrative view. You wouldn't be able to write simple SQL queries to sum up values etc, and would have to open the relevant files, process them with something like awk or regular expressions etc, and then compute the values that way.
You're likely doing that on the client side any for the specific files that relate to that user though, so there's probably some overlap there!
I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/
We're building a system that migrates documents from a different data store into Drive. We'll be doing this for different clients on a regular basis. Therefore, we're interested in performance, because it impacts our customer's experience as well as our time to market in that we need to do testing, and waiting for files to load prolongs each testing cycle.
We have 3 areas of drive interaction
Create folders (there are many, potentially 30,000+)
Upload files (similar in magnitude to the number of folders)
Recursively delete a file structure
In both cases 1 and 2, we run into "User rate limit exceeded" errors with just 2 and 3 threads, respectively. We have an exponential backup policy as suggested that starts at 1 second, and retries 8 times. We're setting the quotaUser on all requests to a random uuid in an attempt to indicate to the server that we don't require user specific rate limiting - but this seems to have had not impact as compared to when we didn't set the quotaUser.
Number 3 currently uses batch queries. 1 and 2 currently use "normal" requests.
I'm looking for guidance on how best to improve the performance of this system.