To read SO's data dump effectively - mysql

I use currently Vim to read SO's data dump. However, my Macbook slows down when I roll down just a few rows. This suggests me that there must be more efficient ways to read the data.
I know little MySQL. The files are in .xml -format. It is rather hard to read the data at the moment in .xml. It may be more efficient to convert the xml -files to MySQL and then read the files. I know only MS db -tool for such actions. However, I would like to know another tool too.
Problems
to parse .xml to SQL -queries such that MySQL understand it. We need to know data structures of the data.
to run the data in MySQL
to find some tool similar to MS db -tool by which we can read the data effectively
How do you read SO's data dump effectively?
--
[edit]
How can you run the 523 SQL queries to create the database in your terminal? I have the commands at the moment in a text -file.
How can you "switch to [the recovery mode] to a simple recovery mode in the database?

I made my first ever python program to read them and output SQL insert statements for use with mysql (It's ugly but worked). You'll need to create the tables first though by hand.
import xml.sax.handler
import xml.sax
import sys
class SOHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.errParse = 0
def startElement(self, name, attributes):
if name != "row":
self.table = name;
self.outFile = open(name+".sql","w")
self.errfile = open(name+".err","w")
else:
skip = 0
currentRow = u"insert into "+self.table+"("
for attr in attributes.keys():
currentRow += str(attr) + ","
currentRow = currentRow[:-1]
currentRow += u") values ("
for attr in attributes.keys():
try:
currentRow += u'"{0}",'.format(attributes[attr].replace('\\','\\\\').replace('"', '\\"').replace("'", "\\'"))
except UnicodeEncodeError:
self.errParse += 1;
skip = 1;
self.errfile.write(currentRow)
if skip != 1:
currentRow = currentRow[:-1]
currentRow += u");"
#print len(attributes.keys())
self.outFile.write(currentRow.encode("utf-8"))
self.outFile.write("\n")
self.outFile.flush()
print currentRow.encode("utf-8");
def characters(self, data):
pass
def endElement(self, name):
pass
if len(sys.argv) < 2:
print "Give me an xml file argument!"
sys.exit(1)
parser = xml.sax.make_parser()
handler = SOHandler()
parser.setContentHandler(handler)
parser.parse(sys.argv[1])
print handler.errParse

Related

Dropping duplicates in a pyarrow table?

Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))

Inconsistent behaviour when attempting to write Dataframe to CSV in Apache Spark

I'm trying to output the optimal hyperparameters for a decision tree classifier I trained using Spark's MLlib to a csv file using Dataframes and spark-csv. Here's a snippet of my code:
// Split the data into training and test sets (10% held out for testing)
val Array(trainingData, testData) = assembledData.randomSplit(Array(0.9, 0.1))
// Define cross validation with a hyperparameter grid
val crossval = new CrossValidator()
.setEstimator(classifier)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new BinaryClassificationEvaluator)
.setNumFolds(10)
// Train model
val model = crossval.fit(trainingData)
// Find best hyperparameter combination and create an RDD
val bestModel = model.bestModel
val hyperparamList = new ListBuffer[(String, String)]()
bestModel.extractParamMap().toSeq.foreach(pair => {
val hyperparam: Tuple2[String,String] = (pair.param.name,pair.value.toString)
hyperparamList += hyperparam
})
val hyperparameters = sqlContext.sparkContext.parallelize(hyperparamList.toSeq)
// Print the best hyperparameters
println(bestModel.extractParamMap().toSeq.foreach(pair => {
println(s"${pair.param.parent} ${pair.param.name}")
println(pair.value)
}))
// Define csv path to output results
var csvPath: String = "/root/results/decision-tree"
val hyperparametersPath: String = csvPath+"/hyperparameters"
val hyperparametersFile: File = new File(hyperparametersPath)
val results = (hyperparameters, hyperparametersPath, hyperparametersFile)
// Convert RDD to Dataframe and write it as csv
val dfToSave = spark.createDataFrame(results._1.map(x => Row(x._1, x._2)))
dfToSave.write.format("csv").mode("overwrite").save(results._2)
// Stop spark session
spark.stop()
After finishing a Spark job, I can see the part-00*... and _SUCCESS files inside the path as expected. However, though there are 13 hyperparameters total in this case (confirmed by printing them on screen), cat-ing the csv files shows not every hyperparameter was written to csv:
user#master:~$ cat /root/results/decision-tree/hyperparameters/part*.csv
checkpointInterval,10
featuresCol,features
maxDepth,5
minInstancesPerNode,1
Also, the hyperparameters that do get written change in every execution. This is executed on a HDFS-based Spark cluster with 1 master and 3 workers that have exactly the same hardware. Could it be a race condition? If so, how can I solve it?
Thanks in advance.
I think I figured it out. I expected dfTosave.write.format("csv")save(path) to write everything to the master node, but since the tasks are distributed to all workers, each worker saves its part of the hyperparameters to a local CSV in its filesystem. Because in my case the master node is also a worker, I can see its part of the hyperparameters. The "inconsistent behaviour" (i.e. seeing different parts in each execution) is caused by whatever algorithm Spark uses for distributing partitions among workers.
My solution will be to collect the CSVs from all workers using something like scp or rsync to build the full results.

Get sequence from msysrelationships table for importing data from XML file

I have a XML file with all the records from all the tables of my DB.
When I have an empty DB with all relationships defined I can read the relationships from the msysrelationships table. Now I would like to know how I can find the correct sequence to import the data.
If I would just import the data as it is presented I could accidentally import data with a reference to not yet existing data. This is a problem.
I have tried a mathematical approach to find the import sequence. I was not able to find a correct function to get the sequence.
Would anyone know how I could make a correct sequence with the info from msysrelationships?
I did find a solution.
First we place all found tables in a array. Query by BIBD: How can I get table names from an MS Access Database?
Then we read all the relationships from msysrelationships.
Now we order the tableNames by dependencies:
Function getDependencies(database)
Dim loopBit
Dim changeFound
loopBit = True
getTableNames(database)
Do While loopbit
changeCount = 0
DB_Connect database, "SELECT msysrelationships.szObject AS TableName, msysrelationships.szReferencedObject AS Dependency FROM msysrelationships ORDER BY msysrelationships.szObject;"
Do Until DB_EndRS = True
Dim tableNameIndex
Dim dependencyIndex
Dim tableParking
For index = 0 to UBound(tableArray)
If tableArray(index) = DB_Select("TableName") Then
tableNameIndex = index
Exit For
End If
Next
For index = 0 to UBound(tableArray)
If tableArray(index) = DB_Select("Dependency") Then
dependencyIndex = index
Exit For
End If
Next
If tableNameIndex < dependencyIndex Then
changeCount = changeCount + 1
tableParking = tableArray(tableNameIndex)
tableArray(tableNameIndex) = tableArray(dependencyIndex)
tableArray(dependencyIndex) = tableParking
End If
DB_MoveNext
Loop
DB_Disconnect
If changeCount = 0 Then
loopBit = false
End If
Loop
End Function
(I hope this is clear enough, please ask if I should provide extra info.)
This allows me to not know the data in the database while understanding the relationships within the database.
I hope this helps someone!

EOF Error During Dict Slice

I am trying to compile monthly data in to an existing JSON file that I loaded via import json. Initially, my json data just had one property which is 'name':
json_data['features'][1]['properties']
>>{'name':'John'}
But the end result with the monthly data I want is like this:
json_data['features'][1]['properties']
>>{'name':'John',
'2016-01': {'x1':0, 'x2':0, 'x3':1, 'x4':0},
'2016-02': {'x1':1, 'x2':0, 'x3':1, 'x4':0}, ... }
My monthly data are on separate tsv files. They have this format:
John 0 0 1 0
Jane 1 1 1 0
so I loaded them via import csv and parsed through a list of urls and set about placing them in a collective dictionary like so:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for i in strings:
with open(i) as f:
tsv_object = csv.reader(f, delimiter='\t')
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
I checked how things turned out by slicing collective_dict like so:
collective_dict['2016-01']['John'][0]
>>'0'
Which is correct; it just needs to be cast into an integer.
For my next feat, I attempted to assign all of the monthly data to the respective json members as part of their external properties:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
Here I got an arrow pointing at the last few characters:
Syntax Error: unexpected EOF while parsing
It is a pretty complicated slice, I suppose user error is not to be ruled out. However, I did double and triple check things. I also looked up this error. It seems to come up with input() related calls. I'm left a bit confused, I don't see how I made a mistake (although I'm already mentally prepared to accept that).
My only guess was that something somewhere was not a string. When I checked collective_dict and json_data, everything that was supposed to be a string was a string ('John', 'Jane' et all). So, I guess it's something else.
I made the problem as simple as I could while keeping the original structure of the data and for loops and so forth. I'm using Python 3.6.
Question
Why am I getting the EOF error? How can I build my external properties data without encountering such an error?
Here I have rewritten your last code block to:
for i in file_strings:
file_name = i[:-4]
for j in range(len(json_data['features'])):
name = json_data['features'][j]['properties']['name']
file_dict = json_data['features'][j]['properties'][file_name] = {}
for x in range(4):
x_string = 'x{}'.format(x+1)
file_dict[x_string] = int(collective_dict[file_name][name][x])
from:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
That is just to make it a bit more readable, but that shouldn't change anything.
A thing I noticed in your other part of code is the following:
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
The thing I refer to is the = rows[0]:rows[1:5] for rows in tsv_object part. In my IDE, that does not work, and I'm not sure if that is a typo in your question or of that is actually in your code, but I imagine you want it to actually be
collective_dict[i[:-4]] = {rows[0]:rows[1:5] for rows in tsv_object}
or something like that. I'm not sure if that could confuse the parser think that there is an error at the end of the file.
The ValueError: Invalid literal for int()
If your tsv-data is
John 0 0 1 0
Jane 1 1 1 0
Then it should be no problem to do int() of the string value. E.g.: int('42') will become an int with value 42. However, if you have an error in one, or several, lines of your files, then use something like this block of code to figure out which file and line it is:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for file_name in file_strings:
print('Reading {}'.format(file_name))
with open(file_name) as f:
tsv_object = csv.reader(f, delimiter='\t')
for line_no, (name, *x_values) in enumerate(tsv_object):
if len(x_values) != 4:
print('On line {}, there is only {} values!'.format(line_no, len(x_values)))
try:
intx = [int(x) for x in x_values]
except ValueError as e:
# Catch "Invalid literal for int()"
print('Line {}: {}'.format(line_no, e))

Using Groovy in Confluence

I'm new to Groovy and coding in general, but I've come a long way in a very short amount of time. I'm currently working in Confluence to create a tracking tool, which connects to a MySql Database. We've had some great success with this, but have hit a wall with using Groovy and the Run Macro.
Currently, we can use Groovy to populate fields within the Run Macro, which really works well for drop down options, example:
{groovy:output=wiki}
import com.atlassian.renderer.v2.RenderMode
def renderMode = RenderMode.suppress(RenderMode.F_FIRST_PARA)
def getSql = "select * from table where x = y"
def getMacro = '{sql-query:datasource=testdb|table=false} ${getSql} {sql-query}"
def get = subRenderer.render(getMacro, context, renderMode)
def runMacro = """
{run:id=test|autorun=false|replace=name::Name, type::Type:select::${get}|keepRequestParameters = true}
{sql:datasource=testdb|table=false|p1=\$name|p2=\$type}
insert into table1 (name, type) values (?, ?)
{sql}
{run}
"""
out.println runMacro
{groovy}
We've also been able to use Groovy within the Run Macro, example:
enter code here
{run:id=test|autorun=false|replace=name::Name, type::Type:select::${get}|keepRequestParameters = true}
{groovy}
def checkSql = "{select * from table where name = '\name' and type = '\$type'}"
def checkMacro = "{sql-query:datasource=testdb|table=false} ${checkSql} {sql-query}"
def check = subRenderer.render(checkMacro, context, renderMode)
if (check == "")
{
println("This information does not exist.")
} else {
println(checkMacro)
}
{groovy}
{run}
However, we can't seem to get both scenarios to work together, Groovy inside of a Run Macro inside of Groovy.
We need to be able to get the variables out of the Run Macro form so that we can perform other functions, like checking the DB for duplicates before inserting data.
My first thought is to bypass the Run Macro and create a simple from in groovy, but I haven't been too lucky with finding good examples. Can anyone help steer me in the right direction for creating a simple form in Groovy that would replace the Run Macro? Or have suggestions on how to get the rendered variables out of the Run Macro?