SparkSQL: Read JSON or Execute Query on Files Directly? - json

I have many, large JSON files that I'd like to run some analytics against. I'm just getting started with SparkSQL and am trying to make sure I understand the benefits between having SparkSQL read the JSON records into an RDD/DataFrame from file (and have the schema inferred) or to run a SparkSQL query on the files directly. If you have any experience using SParkSQL either way I'd be interested to hear which method is preferred and why.
Thank you, in advance, for your time and help!

You can call explain() as an action instead of show() or count() on a dataset. Then Spark will show you the selected physical plan.
You can find the above picture here. As far as I know there should be no difference. But I prefer to use the read() method. When I use an IDE, I can see all the available methods. When you do it with SQL, there could be a mistake like slect instead of select, but you will get the error first, when you run your code.

Related

Uploading Data into Redis

I am working on school project where we need to create website and use Redis to search database, in my case it will be movie database. I got JSON file with names and ratings of 100 movies. I would like to upload this dataset into Redis instead on entering the entire dataset manually. JSON file is saved on my desktop and I am using Ubuntu 20.04.
Is there a way to do it?
I have never used Redis so my question might be very silly. I've been looking all over the internet and cannot find exactly what needs to be done. I might be googling incorrect question maybe that's why I cannot find the answer.
Any help would be appreciated.
Write an appropriate program to do the job. There's no one-size-fits-all process because how your data is structured in redis is up to you; once you decide on that, it should be easy to write a program to parse the JSON and insert the data.

How to get value from hbase and put it into a variable?

This is probably a noob question, so I apologize in advance.
The HBase console, as far as I understand, is an extension (or a script running over) JIRB. Also, it comes with several HBase-specific commands, one of which is 'get' - to retrieve columns\values from a table.
However, it seems like 'get' only writes to screen and doesn't output values at all.
Is there any native hbase console command which will allow me to retrieve a value (e.g. a set of rows\columns), put them into a variable and retrieve their values?
Thanks
No, there is not a native console command in 0.92. If you dig into the source code, there is a class Hbase::Table that could be used to do what you want. I believe this is going to be more exposed in 0.96. At this point, I have resorted to adding my own Ruby to my shell to handle a variety of common tasks (like using SingleColumnValueFilters on scans).

Where to store info besides mysql

My php script pulls about 1000 names from the mysql db on a certain page. These names are used for a javascript autocomplete script.
I think there's a better method to do this. I would like to update the names with a cronjob once a day (php) and store the names locally in a text file? Where else can I store it? It's not sensitive info.
It should be readable and writable to php.
Since you only need the data updated once a day, have a cron-script generate a static json file in some fixed location. Then read this with ajax on the client and make sure it caches it on the client.
Or potentially, generate the file whenever the database is updated (if this is applicable, I don't know your application)
You could try Memcache. But that could be like using a sledge-hammer to crack a nut.
Edit What about storing the data as simple file and let users (JavaScript) download it. Clients would not query the server for every key stroke because they could search for matching values themself. Format could be JSON because it is simple and JavaScript native.
It's unlikely reading from a text file will be much faster than a database query - MySQL already does a lot of caching that should make your query speedy.
If you need to make this query often and performance is a problem often you could consider using a caching module for PHP.
Related
The best way of PHP Caching

CF9 Serializejson gives "out of memory" error

I'm trying to serialize a query to json. The query returns about 300.000 records. When serializing error 500 "out of memory" occurs.
How to solve this. Is there a way to directly stream the query to some file format?
300 records shouldn't be enough to overflow the json library...
How much memory does your server have available / assigned to cf?
Can you paste a stack trace?
We use a handy little library called javacsv.
It is marvelous at creating csvs from arrays of strings. You simply add the .jar file to your class path, then create the java csv class, then call a bunch of methods to add columns or rows. It's good as it automagically quotes all of your data so you dont even have to think about it. It's fast too! I can post some code samples if you are interested.
http://sourceforge.net/projects/javacsv/
CF9 has some spreadsheet exporting methods too, which you should probably check out if you haven't already.
http://cfquickdocs.com/cf9/#cfspreadsheet

Multiple JSON files, parse, and load into tables

I'm a real beginner when it comes time for this, so I apologize in advance.
The long and short of what I am looking for is a fairly simple concept - I want to pull JSON data off a server, parse it, and load it into excel, access, or some other type of tables. Basically, I want to be able to store the data so I can filter, sort, and query it.
To make matters a little more complicated, the server will only return truncated results with each JSON, so it will be necessary to make multiple requests to the server.
Are there tools out there or code available which will help me do what I am looking for? I am completely lost, and I have no idea where to start.
(please be gentle)
I'm glad seeing this question b/c I'm doing very similar things! And based on what I'd gone through, it has lot to do with how those tables are designed or even linked together at first, and then the mapping between these tables and different JSON objects at different depth or position in the original JSON file. After the mapping rules are made clear, the code can be done by merely hard-coding the mapping(I mean like: if you got JSON object after a certain parent of it, then you save the data into certain table(s)) if you're using some high level JSON paring library.
OK as i have to dash home from the office now:
Assuming that you are going to use Excel to Parse the data you are going to need:
1.Some Json Parser JSON Parser for VBA
2.Some code to download the JSON
3.A loop of VBA code that loops through each file and parses it into a sheet.
Is this ok for a starter? If you are struggling let me know and I will try and knock something up a little better over the weekend.