I have a requirement in Spark where I need to fetch data from mysql instances and after some processing enrich them with some more data from a different mysql database.
However, when I try to access the database again from inside a map function, I get a
org.apache.spark.SparkException: Task not serializable
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
My code looks like this:
val reader = sqlContext.read;
initialDataset.map( r => reader.jdbc(jdbcUrl, "(select enrichment_data from other_table where id='${r.getString(1)'}) result", connectionProperties).rdd.first().get(0).toString )
Any ideas / pointers? Should I use two different Datasets? Thanks!
First of all map() function should accept a row from an existing RDD then will apply the changes you made and returns the updated row. This is the reason why you get this exception since scala can't serialize the code reader.jdbc(jdbcUrl, ...
To solve your issue you have multiple options according to your needs:
You could broadcast one of these datasets after collecting it. With broadcast your dataset will be stored into your nodes' memory. This could work if this dataset is reasonably small to fit in the node's memory. Then you could just query it and combine the results with the 2nd dataset
If both datasets are big and not suitable for loading them into node memory then use mapPartition, you can find more information about mapPartition here. mapPartition is called per partition instead of per element that map() does. If you choose this option then you could access the 2nd dataset from mapPartition or even initialize the whole dataset(e.g retrieve all related records from the 2nd database) from mapPartition.
Please be aware that I assumed that these two datasets they do have some kind of dependency(e.g you need to access some value from the 2nd database before executing the next step). If they don't then just create both ds1 and ds2 and use them normally as you would do with any dataset. Finally remember to to cache the datasets if you are sure that you might need to access it multiple times.
Good luck
Related
I'm looking to understand more about dynamic scenario variables and scenario arrays in my Scenario.
What's the difference between “scenario to load data from” set to a dynamic scenario variable and “compare against scenario” set to a scenario array in Foundry Scenarios?
Scenario to load data from is always a single scenario, whereas the comparison scenarios allow displaying values from multiple scenarios.
Notably, only the table and chart widgets support comparison scenarios while just about every widget that has scenario support has the notion of loading data from a single scenario.
There is some nuance in the tables and charts where the scenario to load data from drives what objects appear in the table and in what order. E.g., if you’ve added an object in a scenario and you are using that scenario as a comparison scenario, you will not see the created object in the table because it doesn’t exist in the scenario to load data from.
If you want to view newly created or deleted objects reflected in what loads in the table, you should configure that scenario as the scenario to load data from.
I have a large dataset (approx. 500GB and 180k data points plus labels) in a Pytorch dataloader. Until now, I used torch.utils.data.random_split to split the dataset randomly into training and validation. However, this lead to serious overfitting. Now, I want to rather use a deterministic split, i.e. based on the paths stored in the dataloader, I could figure out a non-random split. However, I have no idea how to do so... The question is: How can I get the IDs of about 10% of the data points based on some query that has a look at the information about the files stored in the data loader (e.g. the paths)?
Have you used a custom dataset along with the dataloader? If the underlying dataset has some variable that stores the filenames of the individual files, you can access it using .dataloader.dataset.filename_variable.
If thats not available, you can create a custom dataset yourself, where you essentially call the original dataset itself.
Need to create one json file for each row from the dataframe. I'm using PartitionBy which creates subfolders for each file. Is there a way to avoid creating the subfolders and rename the json files with the unique key?
OR any other alternatives? Its a huge dataframe with thousands (~300K) of unique values, so Repartition is eating up a lot of resources and taking time.Thanks.
df.select(Seq(col("UniqueField").as("UniqueField_Copy")) ++
df.columns.map(col): _*)
.write.partitionBy("UniqueField")
.mode("overwrite").format("json").save("c:\temp\json\")
Putting all the output in one directory
Your example code is calling partitionBy on a DataFrameWriter object. The documentation tells us that this function:
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/
year=2016/month=02/
This is the reason you're getting subdirectories. Simply removing the call to partitionBy will get all your output in one directory.
Getting one row per file
Spark SQL
You had the right idea partitioning your data by UniqueField, since Spark writes one file per partition. Rather than using DataFrameWriter's partition, you can use
df.repartitionByRange(numberOfJson, $"UniqueField")
to get the desired number of partitions, with one JSON per partition. Notice that this requires you to know the number of JSON's you will end up with in advance. You can compute it by
val numberOfJson = df.select(count($"UniqueField")).first.getAs[Long](0)
However, this adds an additional action to your query, which will cause your entire dataset to be computed again. It sounds like your dataset is too big to fit in memory, so you'll need to carefully consider if caching (or checkpointing) with df.cache (or df.checkpoint) actually saves you computation time. (For large datasets that don't require intensive computation to create, recomputation can actually be faster)
RDD
An alternative to using the Spark SQL API is to drop down to the lower-level RDD. Partitioning by key (in pyspark) for RDDs was discussed thoroughly in the answer to this question. In scala, you'd have to specify a custom Partitioner as described in this question.
Renaming Spark's output files
This is a fairly common question, and AFAIK, the consensus is it's not possible.
Hope this helps, and welcome to Stack Overflow!
I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.
I have a problem where I need to have the dimension key for all rows in the data stream.
I use a lookup component to find the dimension key of the records
Records with no dimension key (lookup no match output) are redirect to a different output because they need to be inserted.
the no match output is multicated
the new records are inserted into the dimension.
a second lookup component should be executed after the records are inserted
number 5 fails because I don't know how to wait for the ADO NET Destination to finish...
Is there any way to solve this other than dump the stream into raw files and use other data flow to resume the task?
I think I understand what you're doing now. Normally you would load the dimension fully first in its own data flow, then after this is fully complete, you load the fact, using the already populated dimension. You appear to be trying to load the fact and dimension in one data flow.
The only reason you'd do this in one data flow is if you couldn't seperate your distinct dimensions from your facts and your fact source is so huge you don't want to go through it twice. Normally you can preload a dimension without a large fact source but this is not always the case. It depends on your source data.
You could use a SEQUENCE (http://technet.microsoft.com/en-us/library/ff878091.aspx) to do this in one data flow. This is a way of autogenerating a number without needing to insert it into a table, but your dimension would need to rely on a sequence instead of an identity. But you'd need to call it in some kind of inline script component or you might be able to trick a lookup component. It would be very slow.
Instead you should try building all of your dimensions in a prior load so that when you arrive at the fact load, all dimensions are already there.
In short the question is: Do you really need to do this in one big step? Can you prebuild your dimension in a prior data flow?
Thomas Kejser wrote a blogpost with a nice solution to this early arriving fact / late arriving dimension problem.
http://blogs.msdn.com/b/sqlcat/archive/2009/05/13/assigning-surrogate-keys-to-early-arriving-facts-using-integration-services.aspx
Basically you use a second lookup with a partial cache. Whenever the partial lookup cache receives a non-matched row it will call a SQL statement and fetch data to populate the lookup cache. If you use a stored proc in this SQL statement you can first add it to dimension table and then use the SELECT statement to alter the cache.