How to write one Json file for each row from the dataframe in Scala/Spark and rename the files - json

Need to create one json file for each row from the dataframe. I'm using PartitionBy which creates subfolders for each file. Is there a way to avoid creating the subfolders and rename the json files with the unique key?
OR any other alternatives? Its a huge dataframe with thousands (~300K) of unique values, so Repartition is eating up a lot of resources and taking time.Thanks.
df.select(Seq(col("UniqueField").as("UniqueField_Copy")) ++
df.columns.map(col): _*)
.write.partitionBy("UniqueField")
.mode("overwrite").format("json").save("c:\temp\json\")

Putting all the output in one directory
Your example code is calling partitionBy on a DataFrameWriter object. The documentation tells us that this function:
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/
year=2016/month=02/
This is the reason you're getting subdirectories. Simply removing the call to partitionBy will get all your output in one directory.
Getting one row per file
Spark SQL
You had the right idea partitioning your data by UniqueField, since Spark writes one file per partition. Rather than using DataFrameWriter's partition, you can use
df.repartitionByRange(numberOfJson, $"UniqueField")
to get the desired number of partitions, with one JSON per partition. Notice that this requires you to know the number of JSON's you will end up with in advance. You can compute it by
val numberOfJson = df.select(count($"UniqueField")).first.getAs[Long](0)
However, this adds an additional action to your query, which will cause your entire dataset to be computed again. It sounds like your dataset is too big to fit in memory, so you'll need to carefully consider if caching (or checkpointing) with df.cache (or df.checkpoint) actually saves you computation time. (For large datasets that don't require intensive computation to create, recomputation can actually be faster)
RDD
An alternative to using the Spark SQL API is to drop down to the lower-level RDD. Partitioning by key (in pyspark) for RDDs was discussed thoroughly in the answer to this question. In scala, you'd have to specify a custom Partitioner as described in this question.
Renaming Spark's output files
This is a fairly common question, and AFAIK, the consensus is it's not possible.
Hope this helps, and welcome to Stack Overflow!

Related

Index multiple CSV files with different headers in Solr

I am trying to index multiple CSV files with different "schemas" in a Solr index. There's possibly some common schema elements (header columns) across these CSVs . My requirement is to be able to provide search across these CSVs amongst other items.
From what I understand, one way to index would be to treat the entire CSV as a giant text string and index that. I am not sure what searchability aspects get impacted if I index that way.
The other way is basically define a common schema and then programmatically extract the columns from the doc and index line by line with the caveat that if a file doesn't have any common schema I may not be able to index it. (BTW, this last part maybe a non-starter for me but just lets indulge the possibility for now)
Are there any other ways ? Is there any advantage to one over another?
BTW, I tried the schemaless mode but it doesn't work for me. I can index the first file but the moment I do the next file and it has some different columns, its giving back an error. Is this expected behaviour or am I doing something wrong?
Appreciate any pointers, thanks!
Update: the error with the schemaless mode is "Invalid date format". After doing some research, it seems like this is a different issue than what I'd thought, caused because Solr is autodetecting the data to be a date and it expects it to be in UTC format and its not. Is there any way for me to turn off autodetection of dates?

Analyzing multiple Json with Tableau

I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.

SSIS Flat File - How to handle file versions / generations

I am working in a data warehouse project with a lot of sources creating flat files as sources and we are using SSIS to load these into our staging tables, we are currently using the Flat File Source component.
However, after a while, we need an extra column in one of the files and from a date the file specification change to add that extra column. This exercise happens quite frequently and over time accumulate quite a lot versions.
According to answers I can find here and on the rest of the internet the agreed method to handle this scenario seems to be to set up a new flat file source in a new separate data flow for this version, to keep re-runablility for ETL process for old files.
Method is outlined here for example: SSIS pkg with flat-file connection with fewer columns will fail
In our specific setup, the additional columns are always additional columns (never remove old columns) and also, for logical reasons the new columns can not be mandantory if we keep re-runability for the older files in their separate data flows.
I don´t think the method of creating a duplicate data flow handling the same set of columns over and over again is a good answer for a data warehouse project as ours and I would prefeer a source component that takes the last file version and have the ability to mark columns as "not mandadory" and deliver nulls if they are missing.
Is anybody aware of a SSIS Flat File component that is more flexible in handle old file versions or have a better solution for this problem?
I assume that such a component would need to approach the files on a named column basis rather than the existing left-to-right approach?
Any thoughts or suggestions are welcome!
The following will lose efficiency when processing (over having separate data flows), but will provide you with the flexibility to handle multiple file types within a single data flow.
You can arrange you flat file connection to return lines rather than individual columns, by only specifying the row delimiter. Connect this to a flat file source component which will output a single column per row. We now have a single row that represents one of the many file types that you are aware of – the next step is to determine which file type you have.
Consume the output from a flat file type with a script component. Pass in a single column and pass out the superset of all possible columns. We have lost the meta data normally gleamed from a file source, so you will need to build up the column name / type / size within the script component output types.
Within the script component, pass the line and break it into its component columns. You will have to perform a pattern match (maybe using RegularExpression.Regex.Match) to identify when a new column starts. Hopefully the file is well formed which will aid you - beware of quotes and commas within text columns.
You can now access the file type by determining the number of columns you have and default the missing columns. Set the rows’ output columns to pass out the constituent parts. You may want to attach a new column to record the file type with your output.
The rest of the process should be able to load your table with a single data flow as you have catered for all file types within your script.
I would not recommend that you perform the above lightly. The benefit of SSIS is somewhat reduced when you have to code up all the columns / types etc, however it will provide you with a single data flow to handle each file version and can be extended as new columns are passed.

MySQL, load data from file, into number of tables

My basic task is to import parts of data from one single file, into several different tables as fast as possible.
I currently have a file per table, and i manage to import each file into the relevant table by using LOAD DATA syntax.
Our product received new requirements from a client, he is no more interested to send us multiple files but instead he wants to send us single file which contains all the original records instead of maintaining multiple such files.
I thought of several suggestions:
I may require the client to write a single raw before each batch of lines in file describing the table to which he want it to be loaded and the number of preceding lines that need to be imported.
e.g.
Table2,500
...
Table3,400
Then i could try to apply LOAD DATA for each such block of lines discarding the Table and line number description. IS IT FEASIBLE?
I may require each record to contain the table name as additional attribute, then i need to iterate each records and inserting it , although i am sure it is much slower vs LOAD DATA.
I may also pre-process this file using for example Java and execute the LOAD DATA as statement in a for loop.
I may require almost any format changes i desire, but it have to be one single file and the import must be fast.
(I have to say that what i mean by saying table description, it is actually a different name of a feature, and i have decided that all relevant files to this feature should be saved in different table name - it is transparent to the client)
What sounds as the best solution? is their any other suggestion?
It depends on your data file. We're doing something similar and made a small perl script to read the data file line by line. If the line has the content we need (for example starts with table1,) we know that it should be in table 1 so we print that line.
Then you can either save that output to a file or to a named pipe and use that with LOAD DATA.
This will probably have a much better performance that loading it in temporary tables and from there into new tables.
The perl script (but you can do it in any language) can be very simple.
You may have another option which is to define a single table and load all your data into that table, then use select-insert-delete to transfer data from this table to your target tables. Depending on the total number of columns this may or may not be possible. However, if possible, you don't need to write an external java program and can entirely rely on the database for loading your data which can also offer you a cleaner and more optimized way of doing the job. You will much probably need to have an additional marker column which can be the name of the target tables. If so, this can be considered as a variant of option 2 above.

Checking for Duplicate Files without Storing their Checksums

For instance, you have an application which processes files that are sent by different clients. The clients send tons of files everyday and you load the content of those files into your system. The files have the same format. The only constraint that you are given is you are not allowed to run the same file twice.
In order to check if you ran a particular file is to create a checksum of the file and store it in another file. So when you get a new file, you can create the checksum of that file and compare against the checksums of others files that you have run and stored.
Now, the file that contains all the checksums of all the files that you have run so far is getting really, really huge. Searching and comparing is taking too much time.
NOTE: The application uses flat files as its database. Please do not suggest to use rdbms or the like. It is simply not possible at the moment.
Do you think there could be another way to check the duplicate files?
Keep them in different places: have one directory where the client(s) upload files for processing, have another where those files are stored.
Or are you in a situation where the client can upload the same file multiple times? If that's the case, then you pretty much have to do a full comparison each time.
And checksums, while they give you confidence that two files are different (and, depending on the checksum, a very high confidence), are not 100% guaranteed. You simply can't take a practically-infinite universe of possible multi-byte streams and reduce them to a 32 byte checksum, and be guaranteed uniqueness.
Also: consider a layered directory structure. For example, a file foobar.txt would be stored using the path /f/fo/foobar.txt. This will minimize the cost of scanning directories (a linear operation) for the specific file.
And if you retain checksums, this can be used for your layering: /1/21/321/myfile.txt (using least-significant digits for the structure; the checksum in this case might be 87654321).
Nope. You need to compare all files. Strictly, need to to compare the contents of each new file against all already seen files. You can approximate this with a checksum or hash function, but should you find a new file already listed in your index then you then need to do a full comparison to be sure, since hashes and checksums can have collisions.
So it comes down to how to store the file more efficiently.
I'd recommend you leave it to professional software such as berkleydb or memcached or voldemort or such.
If you must roll your own you could look at the principles behind binary searching (qsort, bsearch etc).
If you maintain the list of seen checksums (and the path to the full file, for that double-check I mentioned above) in sorted form, you can search for it using a binary search. However, the cost of inserting each new item in the correct order becomes increasingly expensive.
One mitigation for a large number of hashes is to bin-sort your hashes e.g. have 256 bins corresponding to the first byte of the hash. You obviously only have to search and insert in the list of hashes that start with that byte-code, and you omit the first byte from storage.
If you are managing hundreds of millions of hashes (in each bin), then you might consider a two-phase sort such that you have a main list for each hash and then a 'recent' list; once the recent list reaches some threshold, say 100000 items, then you do a merge into the main list (O(n)) and reset the recent list.
You need to compare any new document against all previous documents, the efficient way to do that is with hashes.
But you don't have to store all the hashes in a single unordered list, nor does the next step up have to be a full database. Instead you can have directories based on the first digit, or 2 digits of the hash, then files based on the next 2 digits, and those files containing sorted lists of hashes. (Or any similar scheme - you can even make it adaptive, increasing the levels when the files get too big)
That way searching for matches involves, a couple of directory lookups, followed by a binary search in a file.
If you get lots of quick repeats (the same file submitted at the same time), then a Look-aside cache might also be worth having.
I think you're going to have to redesign the system, if I understand your situation and requirements correctly.
Just to clarify, I'm working on the basis that clients send you files throughout the day, with filenames that we can assume are irrelevant, and when you receive a file you need to ensure its [i]contents[/i] are not the same as another file's contents.
In which case, you do need to compare every file against every other file. That's not really avoidable, and you're doing about the best you can manage at the moment. At the very least, asking for a way to avoid the checksum is asking the wrong question - you have to compare an incoming file against the entire corpus of files already processed today, and comparing the checksums is going to be much faster than comparing entire file bodies (not to mention the memory requirements for the latter...).
However, perhaps you can speed up the checking somewhat. If you store the already-processed checksums in something like a trie, it should be a lot quicker to see if a given file (rather, checksum) has already been processed. For a 32-character hash, you'd need to do a maximum of 32 lookups to see if that file had already been processed rather than comparing with potentially every other file. It's effectively a binary search of the existing checksums rather than a linear search.
You should at the very least move the checksums file into a proper database file (assuming it isn't already) - although SQLExpress with its 4GB limit might not be enough here. Then, along with each checksum store the filename, file size and date received, add indexes to file size and checksum, and run your query against only the checksums of files with an identical size.
But as Will says, your method of checking for duplicates isn't guaranteed anyway.
Despite you asking not to suggets and RDBMS I still will suggest SQLite - if you store all checksums in one table with an index searches will be quite fast and integrating SQLite is not a problem at all.
As Will pointed out in his longer answer, you should not store all hashes in a single large file, but simply split them up into several files.
Let's say the alphanumeric-formatted hash is pIqxc9WI. You store that hash in a file named pI_hashes.db (based on the first two characters).
When a new file comes in, calculate the hash, take the first 2 characters, and only do the lookup in the CHARS_hashes.db file
After creating a checksum, create a directory with the checksum as the name and then put the file in there. If there are already files in there, compare your new file with the existing ones.
That way, you only have to check one (or a few) files.
I also suggest to add a header (a single line) to the file which explains what's inside: The date it was created, the IP address of the client, some business keys. The header should be selected in such a way that you can detect duplicates be reading this single line.
[EDIT] Some file systems bog down when you have a directory with many entries (in this case: the checksum directories). If this is an issue for you, create a second layer by using the first two characters of the checksum as the name of the parent directory. Repeat as necessary.
Don't cut off the two characters from the next level; this way, you can easily find files by checksum if something goes wrong without cutting checksums manually.
As mentioned by others, having a different data structure for storing the checksums is the correct way to go. Anyways, although you have mentioned that you dont want to go the RDBMS way, why not try sqlite? You can use it like a file, and it is lightning fast. It is also very simple to use - most languages has sqlite support built-in, too. It will take you less than 40 lines of code in say python.