Foundry code workbooks are too slow, how to iterate faster? - palantir-foundry

I've noticed that code workbooks are too slow when querying from tables. It is much slower than using SQL from a data warehouse. What is the correct workflow to quickly pull and join data for iterating analysis?

As I hinted on the comment, this is very hard to answer because code workbooks were designed for interactivity, so they are normally very fast. This doesn't mean that there can't be reasons for them to become slower. I'll list some here, maybe they can help you speed up:
Doing code workbooks straight from raw can be slow! Check how many files and the types of files that back a particular dataset. In raw these may be CSV files and not snappy/parquet which would make your compute faster. Which will lead code workbooks to try to infer schema every time you try to iterate. Adding a simple raw -> clean transform in pyspark code repositories, may help a ton here.
Your dataset may be poorly optimized. Having too many files for the datasize. This will lead to code workbooks to take a lot of time hitting disk opening each file. You can verify the files this by going into dataset details tab -> files and check the size of your files. It may be worth to add a repartition on your clean step (same as above). This is spark, not foundry read more here Is it better to have one large parquet file or lots of smaller parquet files?
Your organization may not have enough resources for your compute, or you may have too many people using code workbooks at the same time, for whatever quota your set up. This is something you'll need to check with your platform team, or support channels.
Using AQE and Local mode: How do I get better performance in my Palantir Foundry transformation when my data scale is small?
If you are using python: Not using udfs, these can make your code particularly slow, specially if you are comparing against SQL. PySpark UDFs are known for being notoriously slow Spark functions vs UDF performance?

"What is the correct workflow to quickly pull and join data for
iterating analysis?"
For quick one-off analysis I would recommend to use the Foundry JDBC/ODBC Driver (installed on your local computer) and query the Foundry SQL Server. Note, this will only work with moderate data set result sizes and low query complexities.
This will allow you to have turnaround times of seconds instead of minutes on your queries.

Related

Alternate methods to supply data for machine learning (Other than using CSV files)

I have a question which is relating to machine learning application in real world. It might be sounds stupid lol.
I've been self study machine learning for a while and most of the exercise was using the csv file as data source (both processed and raw). I would like to ask is there any other methods other than import csv file to channel/supply data for machine learning?
Example: Streaming Facebook/ Twitter live feed's data for machine learning in real-time, rather than collect old data and stored them into CSV file.
The data source can be anything. Usually, it's provided as a CSV or JSON file. But in the real world, say you have a website such as Twitter, as you're mentioning, you'd be storing your data in a rational DB such as SQL databases, and for some data you'd be putting them in an in-memory cache.
You can basically utilize both of these to retrieve your data and process it. The thing here is when you have too much data to fit in the memory, you can't really just query everything and process it, in that case, you'll be utilizing some smart algorithms to process data in chunks.
Good thing about some databases such as SQL is that they provide you with a set of functions that you can invoke right in your SQL script to efficiently calculate some data. For example you can get a sum of a column across the whole table or something using SUM() function SQL, which allows for efficient and easy data manipulation

storing json data that will be access/altered often in db or in file?

json will be updated up to ~4 times a day
json will be loaded often (every user will use this as a base
data)
will need to keep last previous version every saved change
(one backup copy)
given these cases is there a definite pro/con of storing the json data in a file on the server vs storing it in the database? And if storing it in the database, would it make sense for it to have its own table (two rows, one current version, one backup copy)
Storing, fetching and even querying JSON these days isn't a big deal - especially with the NoSQL solutions like MongoDB & Cassandra. In fact, a platform like MongoDB will allow you to make direct queries into JSON itself - in fact, it stores it's data as JSON documents and performs quite well. (I am going to assume you are not talking about massive scale, at least not yet.)
The point being that a system like MongoDB has done a lot of the hard work for you. It will effectively optimize things for you like loading frequent documents into memory, optimize their sizes and provide mechanisms for traversing large JSON documents without huge footprints.
If you were to deal with this at the file-by-file level, there are going to be a lot of unforeseen issues that you will need to deal with down the road. You need to manage file handles, watch out for read/write locks on concurrent reads, filesystem permissions, handling disk I/O performance bottlenecks - the list goes on. Even for webservers these days which serve files day and night, where they have done some pretty interesting optimizations to manage the performance of handling files end up working with CDNs (Content Delivery Networks) to optimize performance at the edge and manage scale.
Retaining prior versions of the JSON data can be as simple as simply not over-writing the existing entry and marking previous previous (n-2) version for deletion. This can then be done in a separate thread to "clean-up" or a batch process overnight to remove the extraneous data. (NOTE: this could lead to some fragmentation down the line but it's something that can be compacted later on.)
So, long story short. I wouldn't store JSON on the filesystem anymore. Put it in something like MongoDB and let it handle the nitty gritty details. Until you really get to 1B+ transactions, this should probably do pretty well for you.

Which is better: Many files or a singular file with a lot of data

I am working with a lot of separate data entries and unfortunately do not know SQL, so I need to know which is the faster method of storing data.
I have several hundred, if not in the thousands, individual files storing user data. In this case they are all lists of Strings and nothing else, so I have been listing them line by line as such, accessing the files as needed. Encryption is not necessary.
test
buyhome
foo
etc. (About 75 or so entries)
More recently I have learned how to use JSON and had this question: Would it be faster to leave these as individual files to read as necessary, or as a very large JSON file I can keep in memory?
In memory access will always be much faster than disk access, however if your in memory data is modified and the system crashes you will lose that data if it has not been saved to a form of persistent data storage.
Given the amount of data you say you are working with, you really should be using a database of some sort. Either drop everything and go learn some SQL (the basics are not that hard) or leverage what you know about JSON and look into a NoSQL database like MongoDB.
You will find that using the right tool for the job often saves you more time in the long run than trying to force the tool you currently have to work. Even if you need to invest some time upfront to learn something new.
First thing is: DO NOT keep data in memory. Unless you are creating portal like SO or Reddit, RAM as a storage is a bad idea.
Second thing is: reading a file is slow. Opening and closing a file is slow too. Try to keep number of files as low as possible.
If you are gonna use each and every of those files (key issue is EVERY), keep them together. If you will only need some of them, store them separately.

How to approach frequent mass updates to SQL database by non-programmer. Should the DB just import from Excel each time?

I am new to database development so any advice is appreciated.
I am tasked with creating a SQL database. This database will store data that is currently on an Excel file. The data in about 50 rows changes every single day (different rows each day).
The person updating the data is a non-programmer and the updating of info needs to be simple and fast.
I'm thinking my client could just update the Excel file, and then this Excel file will upload to the database. Is that feasible? Is there a better way? My client spends enough time just updating the Excel file, so anything that will take a significant amount of extra time for inputing data is not feasible.
Any feedback or ideas is appreciated!
Context: I haven't made any decisions about which SQL DBMS I will use (or maybe I'll use noSQL or Access). I'm still in the how-should-I-approach-this stage of development.
If your data all fits in an excel file, there's nothing wrong with that approach. You need to spend your time thinking about how you want to get the data from excel in to the DB, as you have a ton of options as far as programming languages / tools to do that.
I'm personally a huge fan of Node.js (there are npm modules already existing for reading excel files, writing to mysql, etc, so you would end up writing almost nothing yourself) but you could do it using just about anything.
Do yourself a favor and use a simple database (like MySQL) and don't mess with NoSQL for this. The amount of data you have is tiny (if it's coming from an excel file) and you really don't need to start worrying about the complexities of NoSQL until you have a a TON of data or data that is changing extremely rapidly.
If the business process is currently updating Excel, then that is probably the best approach. Modify the Excel file. A little bit of VBA in the Excel can copy it and store it somewhere for the database. Schedule a job in the evening, load the data in, and do whatever processing you like. This is not an unusual architecture.
Given your choice of tools, I would advise you not to use either Access or a NoSQL database, unless you have a particular reason for choosing them. A free database such as Postgres or SQL Server Express should do a good job meeting your needs.

SQL Assemblies vs Application code for complicated queries on large XML columns

I have a table with a few relational columns and one XML column which sometimes holds a fairly large chunk of data. I also have a simple webservice which uses the database. I need to be able to report on things like all the instances of a certain element within the XML column, a list of all the distinct values for a certain element, things like that.
I was able to get a list of all the distinct values for an element, but didn't get much further than that. I ended up writing incredibly complex T-SQL code to do something that seems pretty simple in C#: go through all the rows in this table, and apply this ( XPath | XQuery | XSLT ) to the XML column. I can filter on the relational columns to reduce the amount of data, but this is still a lot of data for some of the queries.
My plan was to embed an assembly in SQL Server (I'm using 2008 SP2) and have it create an indexed view on the fly for a given query (I'd have other logic to clean this view up). This would allow me to keep the network traffic down, and possibly also allow me to use tools like Excel and MSRS reports as a cheap user interface, but I'm seeing a lot of people saying "just use application logic rather than SQL assemblies". (I could be barking entirely up the wrong tree here, I guess).
Grabbing the big chunk of data to the web service and doing the processing there would have benefits as well - I'm less constrained by the SQL Server environment (since I don't live inside it) and my setup process is easier. But it does mean I'm bringing a lot of data over the network, storing it in memory while I process it, then throwing some of it away.
Any advice here would be appreciated.
Thanks
Edit:
Thanks guys, you've all been a big help. The issue was that we were generating a row in the table for a file, and each file could have multiple results, and we would doing this each time we ran a particular build job. I wanted to flatten this out into a table view.
Each execution of this build job checked thousands of files for several attributes, and in some cases each of these tests these were generating thousands of results (MSIVAL tests were the worst culprit).
The answer (duh!) is to flatten it out before it goes into the database! Based on your feedback, I decided to try creating a row for each result for each test on each file, and the XML just had the details of that one result - this made the query much simpler. Of course, we now have hundreds of thousands of rows each time we run this tool but the performance is much better. I now have a view which creates a flattened version of one of the classes of results that are emitted by the build job - this returns >200,000 and takes <5 seconds, compared to around 3 minutes for the equivalent (complicated) query before I went the flatter route, and between 10 and 30 minutes for the XML file processing of the old (non-database) version.
I now have some issues with the number of times I connect, but I have an idea of how to fix that.
Thanks again! +1's all round
I suggest using the standard xml tools in TSQL. (http://msdn.microsoft.com/en-us/library/ms189075.aspx). If you don't wish to use this I would recommend processing the xml on another machine.
SQLCLR is perfect for smaller functions, but with the restrictions on the usable methods it tends to become an exercise in frustration once you are trying to do more advanced things.
What you're asking about is really a huge balancing act and it totally depends on several factors. First, what's the current load on your database? If you're running this on a database that is already under heavy load, you're probably going to want to do this parsing on the web service. XML shredding and querying is an incredibly expensive procedure in SQL Server, especially if you're doing it on un-indexed columns that don't have a schema defined for them. Schemas and indexes help with this processing overhead, but they can't eliminate the fact that XML parsing isn't cheap. Secondly, the amount of data you're working with. It's entirely possible that you just have too much data to push over the network. Depending on the location of your servers and the amount of data, you could face insurmountable problems here.
Finally, what are the relative specs of your machines? If your web service machine has low memory, it's going to be thrashing data in and out of virtual memory trying to parse the XML which will destroy your performance. Maybe you're not running the most powerful database hardware and shredding XML is going to be performance prohibitive for the CPU you've got on your database machine.
At the end of the day, the only way to really know is to try both ways and figure out what makes sense for you. Doing the development on your web services machine will almost undoubtedly be easier as LINQ to XML is a more elegant way of parsing through XML than XQuery shoehorned into T-SQL is. My indication, given the information you provided in your question, is that T-SQL is going to perform better for you in the long run because you're doing XML parsing on every row or at least most rows in the database for reporting purposes. Pushing that kind of information over the network is just ugly. That said, if performance isn't that important, there's something to be said about taking the easier and more maintainable route of doing all the parsing on the application server.