PySpark Partitioning with Overlap - partitioning

I am trying to partition my data to send to multiple machines for PySpark to run on at the same time, but then some of the data I want to send to one machine I also want to send to a different machine. How would I partition the data with overlaps?

rdd.randomSplit([1]*N) returns a list of N equal sized rdd, I guess you can replicate items of your list before sending them

Related

Sync data from multiple local mysql instances to one cloud database

I am looking for a solution to sync data from multiply small instances to one big cloud instance.
I have many devices gathering data logs, every device has there own database, so I need a solution to sync data from them to one instance. The delay is not important but I want to sync the data with a max delay of 5-10 min.
Is there any ready solution for it?
Assuming all the data is independent, INSERT all the data into a single table. That table would, of course, have a device_id column to distinguish where the numbers are coming from.
What is the total number of rows per second you need to handle? If less than 1000/second, there should be no problem inserting the rows into the same table as the arrive.
Are you using HTTP? Or something else to do the INSERTs? PHP? Java?
With this, you will rarely see more than a 1 second delay between the reading being taken and the table having the value.
I recommend
PRIMARY KEY(device_id, datetime)
And the use of Summary tables rather than slogging through that big Fact table to do graphs and reports.
Provide more details if you would like further advice.

Real Time data processing without delay

Guys we have a problem to solve We are running our server on AWS and database is MySQL Amazon Aurora. Our data comes in one table via API from mobile devices and goes to another table via AWS pipelines so the problem is that we want to apply some aggregation on real time data while inserting to second table. Remembered data is constantly coming up in the first table.
Example:
Problem:
Want to apply aggregation (mostly sum operations on clicks, impressions columns etc) on real time data and store in second table so we can display summarize data to our users without much delay, please ask me if you don't understand the problem.
What we want:
How can we apply aggregation and process real time data while data is coming constantly, We want to know the best approach to solve this problem.

Sequence of mysql queries in Spark

I have a requirement in Spark where I need to fetch data from mysql instances and after some processing enrich them with some more data from a different mysql database.
However, when I try to access the database again from inside a map function, I get a
org.apache.spark.SparkException: Task not serializable
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
My code looks like this:
val reader = sqlContext.read;
initialDataset.map( r => reader.jdbc(jdbcUrl, "(select enrichment_data from other_table where id='${r.getString(1)'}) result", connectionProperties).rdd.first().get(0).toString )
Any ideas / pointers? Should I use two different Datasets? Thanks!
First of all map() function should accept a row from an existing RDD then will apply the changes you made and returns the updated row. This is the reason why you get this exception since scala can't serialize the code reader.jdbc(jdbcUrl, ...
To solve your issue you have multiple options according to your needs:
You could broadcast one of these datasets after collecting it. With broadcast your dataset will be stored into your nodes' memory. This could work if this dataset is reasonably small to fit in the node's memory. Then you could just query it and combine the results with the 2nd dataset
If both datasets are big and not suitable for loading them into node memory then use mapPartition, you can find more information about mapPartition here. mapPartition is called per partition instead of per element that map() does. If you choose this option then you could access the 2nd dataset from mapPartition or even initialize the whole dataset(e.g retrieve all related records from the 2nd database) from mapPartition.
Please be aware that I assumed that these two datasets they do have some kind of dependency(e.g you need to access some value from the 2nd database before executing the next step). If they don't then just create both ds1 and ds2 and use them normally as you would do with any dataset. Finally remember to to cache the datasets if you are sure that you might need to access it multiple times.
Good luck

Large JSON Storage

Summary
What is the "best practice" way to store large JSON arrays on a remote web service?
Background
I've got a service, "service A", that generates JSON objects, an "item", no larger than 1KiB. Every time it emits an item, the item needs to be appended to a JSON array. Later, a user can get all these arrays of items, which can be 10s of MiB or more.
Performance
What is the best way to store JSON to make appending and retrieval performant? Ideally, insertation would be O(1) and retrieval would be fast enough that we didn't need to tell the user to wait until their files have downloaded.
The downloads have never become so large that the constraint is the time to download them from the server (if they were a 10 MiB file). The constraint has always been the time to compute the file.
Stack
Our current stack is running Django + Postgresql on Elasticbeanstalk. New services are acceptable (e.g. S3 if append were supported).
Attempted Solutions
When we try to store all JSON in a single row in the database, performance is understandably slow.
When we try to store each JSON object in a separate row, it takes too long to aggregate the separate rows into a single array of items. In addition, a user requests all item arrays in their account every time they visit the main screen of the app, so it is inefficient to recompute the aggregated array of items each time.

Analyzing multiple Json with Tableau

I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.