Hi I am not sure If I am heading towards a right solution and needed some advice.
I have some social media platform connectors that dump files in a CSV format. What I am trying to achieve is for instance a CSV file has impressions, reach, clicks as columns - I want to then create a data pipeline in Google Cloud Platform to use MySQL workbench to input only impressions and clicks from the CSV files into a table.
is this possible? If not, what are the recommendations? I can use Big Query for this but we just want to work with a subset of the CSV data and not all of it.
Suggestions please!
For you there is Dataprep - An intelligent cloud data service to visually explore, clean, and prepare data.
There you can
feed in all your CSV from Cloud Storage
visually explore
setup recipes to cleaning, transforming
or filtering, or joining datasets
and jobs to write the combined results into a final CSV for example or BigQuery
According to the power and the price of BigQuery, I will use it. You can either load your file in BigQuery in staging/temp tables or let them in Cloud Storage and scan them with federated table
The principle is to query the files, get the columns/data that you want, optionally transform/filter/clean them, and store them in your final table
Create table XXX
Select ... from <Staging table/external table>
Related
I had developed a web app on a mySQL database and now I am switching to Android Mobile Development but I have large amount of data to be exported into Firebase's Cloud Firestore. I could not find a way to do so, I have the mySQL data stored in JSON and CSV.
Do I have to write a script? If yes then can you share the script or is there some sort of tool?
I have large amounts of data to be exported into Firebase's Cloud Firestore, I could not find a way to do so
If you're looking for a "magic" button that can convert your data from a MySQL database to a Cloud Firestore database, please note that there isn't one.
Do I have to write a script?
Yes, you have to write code in order to convert your actual MySQL database into a Cloud Firestore database. Please note that both types of databases share two different concepts. For instance, a Cloud Firestore database is composed of collections and documents. There are no tables in the NoSQL world.
So, I suggest you read the official documentation regarding Get started with Cloud Firestore.
If yes then can you share the script or is there some sort of tool.
There is no script and no tool for that. You should create your own mechanism for that.
I wanted to be able to store Bigquery results as json files in Google Cloud Storage. I could not find an OOB way of doing this so what I had to do was
Run query against Bigquery and store results in permanent tables. I use a random guid to name the permanent table.
Read data from bigquery, convert it to json in my server side code and upload json data to GCS.
Delete permanent table.
Return the json file url in GCS to front end application.
While this works there are some issues with this.
A. I do not believe I am making use of BigQuery's caching by making use of my own permanent tables. Can someone confirm this?
B. Step 2 will be a performance bottleneck. To pull data out of GCP to do JSON conversion to reupload into GCP just feels wrong. A better approach would be to use some cloud native serverless function or some other GCP data workflow type service to do this step that gets triggered upon creation of a new table in the dataset. What do you think is the best way to achieve this step?
C. Is there really no way to do this without using permanent tables?
Any help appreciated. Thanks.
With persistent table, your are able to leverage Bigquery Data Exporting to export the table in JSON format to GCS. It has no cost, comparing with you reading the table from your server side.
Right now, there is indeed a way to avoid creating permanent table. Because every query result is actually a temporary table already. If you go to "Job Information" you can find the full name of the temp table, which can be used in Data Exporting to be exproted as a JSON to GCS. However, this is way more complicated than just create a persistent table and delete it afterwards.
I have searched high and low, but it seems like mysqldump and "select ... into outfile" are both intentionally blocked by not allowing file permissions to the db admin. Wouldn't it save a lot more server resources to allow file permissions than to disallow them? Any other import/export method I can find uses executes much slower, especially with tables that have millions of rows. Does anyone know a better way? I find it hard to believe Azure left no good way to do this common task.
You did not list the other options you found to be slow, but have you thought about using Azure Data Factory:
Use Data Factory, a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
It supports exporting data from Azure MySQL and MySQL:
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are supported as sources/sinks by the copy activity, see Supported data stores and formats
Azure Data Factory allows you to define mappings (optional!), and / or transform the data as needed. It has a pay per use pricing model.
You can start an export manually or using a schedule using the .Net or Python SKD , the Rest api or Powershell.
It seems you are looking to export the data to a file, so Azure Blob Storage or Azure Files are likely to be a good destination. FTP or the local file system are also possible.
"SELECT INTO ... OUTFILE" we can achieve this using mysqlworkbench
1.Select the table
2.Table Data export wizard
3.export the data in the form of csv or Json
I'm in a traditional Back to Front ETL stack from a data source (Adobe Analytics) to MySQL data warehouse to a Tableau front end for visualization. My question revolves around best practices for cleaning data / mapping and at what step.
1) Cleaning: We have no automated (SSIS, etc.) connector from the source (Adobe) to the data warehouse so we're left with periodic uploading of CSV files. For various reasons these files become less than optimal (misspellings, nulls, etc.) Question: should the 'cleaning' be done to the CSV files, or once the data is uploaded into MySQL data warehouse (in tables/views)?
2) Mapping: a number of different end user use cases require us to map the data to tables (geographic regions, type of accounts, etc.)... should this be done in the data warehouse (MySQL joins) or is it just as good in the front end (Tableau)? The real question pertains to performance, I believe, as you could do it relatively easily in either step.
Thanks!
1) Cleaning: I'd advise you to load the data in the CSV files into a staging database and clean it from there, before it reaches the database to which you connect Tableau to. This way you can keep the original files, which you can eventually reload if necessary. I'm not sure what a "traditional Back to Front ETL stack" is, but an ETL tool like Microsoft SSIS or Pentaho Data Integrator (free) will be a valuable help with building these processes, and you could then run your ETL jobs periodically or every time a new file is uploaded to the directory. Here is a good example of such a process: https://learn.microsoft.com/en-us/sql/2014/integration-services/lesson-1-create-a-project-and-basic-package-with-ssis
2) "Mapping": You should have a data model, probably a dimensional model, built on the database that Tableau connects to. This data model should store clean and "business modelled" data. You should perform the lookups (joins/mappings) when you are Transforming your data, so you can Load it into the data model. Having Tableau explore a dimensional model of clean data will also be better for UX/performance.
The overall flow would look something like: CSV -> Staging database -> Clean/Transform/Map -> Business data model (database) -> Tableau
We are finally moving from Excel and .csv files to databases. Currently, most of my Tableau files are connected to large .csv files (.twbx).
Is there any performance differences between PostgreSQL and MySQL in Tableau? Which would you choose if you were starting from scratch?
Right now, I am using pandas to join files together and creating a new .csv file based on the join.(Example, I take a 10mil row file and drop duplicates and create a primary key, then I join it with the same key on a 5mil row file, then I export the new 'Consolidated' file to .csv and connect Tableau to it. Sometimes the joins are complicated involving dates or times and several columns).
I assume I can create a view in a database and then connect to that view rather than creating a separate file, correct? Each of my files could instead be a separate table which should save space and allow me to query dates rather than reading the whole file into memory with pandas.
Some of the people using the RDMS would be completely new to databases in general (dashboards here are just Excel files, no normalization, formulas in the raw data sheet, etc.. it's a mess) so hopefully either choice has some good documentation to lesson the learning curve (inserting new data and selecting data mainly, not the actual database design).
Both will work fine with Tableau. In fact, Tableau's internal data engine is based on Postgres.
Between the two, I think Postgres is more suitable for a central data warehouse. MySQL doesn’t allow certain SQL methods such as Common Table Expressions and Window Functions.
Also, if you’re already using Pandas, Postgres has a built-in Python extension called PL/Python.
However, if you’re looking to store a small amount of data and get to it really fast without using advanced SQL, MySQL would be a fine choice but Postgres will give you a few more options moving forward.
As stated, either database will work and Tableau is basically agnostic to the type of database that you use. Check out https://www.tableau.com/products/techspecs for a full list of all native (inbuilt & optimized) connections that Tableau Server and Desktop offer. But, if your database isn't on that list you can always connect over ODBC.
Personally, I prefer postgres over mysql (I find it really easy to use psycopg2 to write to postgres from python), but mileage will vary.