Insert large text file in MySQL using Python - mysql

I have a big text file which contains comma separeted data. But inside the file, there are multiple data chunks separeted by tags and .
I need to insert these chunks in different tables in mySQL.
I'm reading the file line by line.
When is found, I 'create if not exists' a new table and start inserting line by line.
But this is taking a lot of time as the file has around 1M rows.
Is there a better way to speedup the inserting process?
Below is the code I'm using
with open(file_name_with_path) as csv_file:
csv_reader = csv.reader(csv_file,delimiter=',')
line_count = 0
start_count = 0
new_file = False
file_end = False
ignore_line = False
record_list = []
file_name = ''
temp_data = []
for row in csv_reader:
ignore_line = False
if row[0].find('<START>') ==0:
file_name = row[1]
line_count = 0
new_file = True
ignore_line = True
if row[0].find('<END>') == 0:
file_end = True
record_list.append(file_name + '_' + str(line_count))
line_count = 0
ignore_line = True
print('End of a file.')
if new_file == True:
print('Start of a new file.')
new_file = False
if line_count == 0 and ignore_line is False and new_file is False:
line_count += 1
create_table(file_name,row)
temp_data.append(row)
elif ignore_line == False:
line_count += 1
add_data(file_name, row)
temp_data.append(row)
print(f'Processed the file - {",".join(record_list)}' )
Sample text file.
<START> File1
col1,col2,col3
1,A,B
2,C,D
<END>
<START> File2
col1,col2,col3,col4,col5
1,A,B,1,2
2,C,D,3,4
<END>

You didn't show us the code you're running.
You suggested that, optimistically, you're getting
app-level throughput of around 40 row / sec (50k / 1200),
which we all agree is "low".
You should be able to easily achieve
one or two orders of magnitude better performance.
It's not clear if you're using a local or remote instance.
Possibly you're doing 40 commits per second to local disk.
Or possibly you're managing 40 WAN round-trip messages per second.
What you want to be using is .executemany() with a batch size
of about ten thousand rows inserted per COMMIT.
That would be about a hundred COMMITs for your million rows.
Numerous examples are available on the web, e.g.
https://pynative.com/python-mysql-insert-data-into-database-table/

Related

If statement not returning the desired result

I'm new to Python and I believe the issue with my code is being caused by the fact that I'm a newbie and there's some theory or something that I must not be familiar with yet.
Yes, this question was asked before but, is different from mine. Believe me I tried everything that I thought that needs to be done.
Everything worked until I added everything in "if five in silos" statement.
After I enter the values for the 6 input functions, the program just finishes with exit code 0. Nothing else happens. The for loop is not initiated.
I want for the code to accept either 103 or 106 when prompting to enter something for the "five" variable.
I'm using PyCharm and Python 3.7.
import mysql.connector
try:
db = mysql.connector.connect(
host="",
user="",
passwd="",
database=""
)
one = int(input("Number of requested telephone numbers: "))
two = input("Enter the prefix (4 characters) with a leading 0: ")[:4]
three = int(input("Enter the ccid: "))
four = int(input("Enter the cid: "))
six = input("Enter case number: ")
five = int(input("Enter silo (103, 106 only): "))
cursor = db.cursor()
cursor.execute(f"SELECT * FROM n1 WHERE ddi LIKE '{two}%' AND silo = 1 AND ccid = 0 LIMIT {one}")
cursor.fetchall()
silos = (103, 106)
if five in silos:
if cursor.rowcount > 0:
for row in cursor:
seven = input(f"{row[1]} has been found on our system. Do you want to continue? Type either Y or N.")
if seven == "Y":
cursor.execute(f"INSERT INTO n{five} (ddi, silo, ccid, campaign, assigned, allocated, "
f"internal_notes, client_notes, agentid, carrier, alias) VALUES "
f"('{row[1]}', 1, {three}, {four}, NOW(), NOW(), 'This is a test.', '', 0, "
f"'{row[13]}', '') "
f"ON DUPLICATE KEY UPDATE "
f"silo = VALUES (silo), "
f"ccid = VALUES (ccid), "
f"campaign = VALUES (campaign);")
cursor.execute(f"UPDATE n1 SET silo = {five}, internal_notes = '{six}', allocated = NOW() WHERE "
f"ddi = '{row[1]}'")
else:
print("The operation has been canceled.")
db.commit()
else:
print(f"No results for prefix {two}.")
else:
print("Enter either silo 103 or 106.")
cursor.close()
db.close()
except (ValueError, NameError):
print("Please, enter an integer for all questions, except case number.")
Because it must be:
for row in cursor.fetchall():
// do something
In your code cursor returns a Python Class defined by db.cursor() but you need to call the fetchall() function to read the rows contained in it.
You're actually calling cursor.fetchall() without doing nothing with it, you can assign the call to a variable and than do this:
result = cursor.fetchall()
for row in result:
//do something
I found the problem: I had to store cursor.fetchall() into a variable.
After I put: eight = cursor.fetchall() before the "silos" tuple, everything worked perfectly.

Automating a process for multiple CSV file

I've been looking around and couldn't find the answer so here it is.
I'm trying to look into a way for automating of changing the content of a CSV file into something else for machine learning purposes. I have the content of a single line like this:
0, 0, 0, -2.3145, 5.567...... 65, 65, 125, 70.
(516 columns)
And trying to change it to this:
0,
0,
-2.3145,
5.567
....
65,
65,
125,
70.
(516 rows)
So basically transposing the data from horizontal to vertical (single row to single column).
It's easily done using Excel but problem is I have 4000+ of the CSV file so it takes a lot of time.
On top of that, I have to get the first 512 rows and store it into a CSV of another folder adding the last 4 rows into another CSV of another folder while both files have the same name.
Eg:
features(folder)
1.CSV
2.CSV
.....
4000+.CSV
labels(folder)
1.CSV
2.CSV
.....
4000+.CSV
Any suggestions on how I can speed things up? Tried writing my own program but I'm stumped on changing it from row to column. I've only managed to split the single CSV file to it's 4000+ pieces.
EDIT:
I've tested by putting the csv rows into an array and then storing the array into the csv where the code looks like this:
with open('FFTMIM16_512L1H1S0D0_1194.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list[0:512])
print(your_list[512:516])
print(your_list)
with open('test.csv', 'w', newline = '') as fa:
writer = csv.writer(fa)
writer.writerows(your_list[0:511])
with open('test1.csv', 'w', newline = '') as fb:
writer = csv.writer(fb)
writer.writerows(your_list[512:516])
It works but I just need to run it in a loop. A problem that I don't understand is that if I save the values from 0 to 512 on test.csv, it will show 512 counts of rows but when I store from 513 to 516 to test1.csv, it only shows three instead of four rows that I need. Changing fb content from 512 to 516 will work which doesn't make sense to me because the value of 512 in test.csv is 0 while test1.csv is 69. Why is that? From what I can understand is the index of the array, it starts from 0 to the place of number I need. Or is it not the case in python?
EDIT 2:
My new code is as follows:
import csv
import os
import glob
#import itertools
directory = input("INPUT FOLDER: ")
output1 = input("FEATURES FODLER: ")
output2 = input("LABELS FOLDER: ")
in_files = os.path.join(directory, '*.csv')
for in_file in glob.glob(in_files):
with open(in_file) as input_file:
reader = csv.reader(input_file)
your_list = (reader)
filename = os.path.splitext(os.path.basename(in_file))[0] + '.csv'
with open(os.path.join(output1, filename), 'w', newline='') as output_file1:
writer = csv.writer(output_file1)
writer.writerow(your_list[0:512])
with open(os.path.join(output2, filename), 'w', newline='' ) as output_file2:
writer = csv.writer(output_file2)
writer.writerow(your_list[512:516])
It shows the output as I wanted but now it stores apostrophes and braces eg. ['0.0'], ['2.321223'] as well. How do I remove these?
I don't understand why you can't do it programatically if you have your 4000+ pieces, just write every piece in a new line?
In my opinion the easiest way, but not automatically, would be some editor like Notepad ++.
Here you can Replace "," by "\r\n" or if you want to keep the "," you replace it with ",\r\n".
If you want it automated i don't see a not programmatical way.
By the way... if you use python with numpy/scipy you can just use the .transpose() function
*Edit to your comment:
what do you mean with "split from the first to the 512"? If you want parts with the size 512 it would be something like:
new_array = []
temp_array = []
k = 0
for num in your_array:
temp_array.append(num)
k += 1
if k % 512 == 0:
new_array.append(temp_array)
k = 0
temp_array = []
#to append the last block which might not be 512 sized
if len(temp_array) > 0:
new_array.append(temp_array)
# Save Arrays
for i in len(new_array):
saveToCsv(array = new_array[i], name="csv_"+str(i))
Your new_array would now be an array filled with 512 sized arrays.
Might be mistakes here, i did not test the code. To save you only need a function saveToCsf(array, name) which saves an array into a file.

Shapefile with overlapping polygons: calculate average values

I have a very big polygon shapefile with hundreds of features, often overlapping each other. Each of these features has a value stored in the attribute table. I simply need to calculate the average values in the areas where they overlap.
I can imagine that this task requires several intricate steps: I was wondering if there is a straightforward methodology.
I’m open to every kind of suggestion, I can use ArcMap, QGis, arcpy scripts, PostGis, GDAL… I just need ideas. Thanks!
You should use the Union tool from ArcGIS. It will create new polygons where the polygons overlap. In order to keep the attributes from both polygons, add your polygon shapefile twice as input and use ALL as join_attributes parameter.This creates also polygons intersecting with themselves, you can select and delete them easily as they have the same FIDs. Then just add a new field to the attribute table and calculate it based on the two original value fields from the input polygons.
This can be done in a script or directly with the toolbox's tools.
After few attempts, I found a solution by rasterising all the features singularly and then performing cell statistics in order to calculate the average.
See below the script I wrote, please do not hesitate to comment and improve it!
Thanks!
#This script processes a shapefile of snow persistence (area of interest: Afghanistan).
#the input shapefile represents a month of snow cover and contains several features.
#each feature represents a particular day and a particular snow persistence (low,medium,high,nodata)
#these features are polygons multiparts, often overlapping.
#a feature of a particular day can overlap a feature of another one, but features of the same day and with
#different snow persistence can not overlap each other.
#(potentially, each shapefile contains 31*4 feature).
#the script takes the features singularly and exports each feature in a temporary shapefile
#which contains only one feature.
#Then, each feature is converted to raster, and after
#a logical conditional expression gives a value to the pixel according the intensity (high=3,medium=2,low=1,nodata=skipped).
#Finally, all these rasters are summed and divided by the number of days, in order to
#calculate an average value.
#The result is a raster with the average snow persistence in a particular month.
#This output raster ranges from 0 (no snow) to 3 (persistent snow for the whole month)
#and values outside this range should be considered as small errors in pixel overlapping.
#This script needs a particular folder structure. The folder C:\TEMP\Afgh_snow_cover contains 3 subfolders
#input, temp and outputs. The script takes care automatically of the cleaning of temporary data
import arcpy, numpy, os
from arcpy.sa import *
from arcpy import env
#function for finding unique values of a field in a FC
def unique_values_in_table(table, field):
data = arcpy.da.TableToNumPyArray(table, [field])
return numpy.unique(data[field])
#check extensions
try:
if arcpy.CheckExtension("Spatial") == "Available":
arcpy.CheckOutExtension("Spatial")
else:
# Raise a custom exception
#
raise LicenseError
except LicenseError:
print "spatial Analyst license is unavailable"
except:
print arcpy.GetMessages(2)
finally:
# Check in the 3D Analyst extension
#
arcpy.CheckInExtension("Spatial")
# parameters and environment
temp_folder = r"C:\TEMP\Afgh_snow_cover\temp_rasters"
output_folder = r"C:\TEMP\Afgh_snow_cover\output_rasters"
env.workspace = temp_folder
unique_field = "FID"
field_Date = "DATE"
field_Type = "Type"
cellSize = 0.02
fc = r"C:\TEMP\Afgh_snow_cover\input_shapefiles\snow_cover_Dec2007.shp"
stat_output_name = fc[-11:-4] + ".tif"
#print stat_output_name
arcpy.env.extent = "MAXOF"
#find all the uniquesID of the FC
uniqueIDs = unique_values_in_table(fc, "FID")
#make layer for selecting
arcpy.MakeFeatureLayer_management (fc, "lyr")
#uniqueIDs = uniqueIDs[-5:]
totFeatures = len(uniqueIDs)
#for each feature, get the date and the type of snow persistence(type can be high, medium, low and nodata)
for i in uniqueIDs:
SC = arcpy.SearchCursor(fc)
for row in SC:
if row.getValue(unique_field) == i:
datestring = row.getValue(field_Date)
typestring = row.getValue(field_Type)
month = str(datestring.month)
day = str(datestring.day)
year = str(datestring.year)
#format month and year string
if len(month) == 1:
month = '0' + month
if len(day) == 1:
day = '0' + day
#convert snow persistence to numerical value
if typestring == 'high':
typestring2 = 3
if typestring == 'medium':
typestring2 = 2
if typestring == 'low':
typestring2 = 1
if typestring == 'nodata':
typestring2 = 0
#skip the NoData features, and repeat the following for each feature (a feature is a day and a persistence value)
if typestring2 > 0:
#create expression for selecting the feature
expression = ' "FID" = ' + str(i) + ' '
#select the feature
arcpy.SelectLayerByAttribute_management("lyr", "NEW_SELECTION", expression)
#create
#outFeatureClass = os.path.join(temp_folder, ("M_Y_" + str(i)))
#create faeture class name, writing the snow persistence value at the end of the name
outFeatureClass = "Afg_" + str(year) + str(month) + str(day) + "_" + str(typestring2) + '.shp'
#export the feature
arcpy.FeatureClassToFeatureClass_conversion("lyr", temp_folder, outFeatureClass)
print "exported FID " + str(i) + " \ " + str(totFeatures)
#create name of the raster and convert the newly created feature to raster
outRaster = outFeatureClass[4:-4] + ".tif"
arcpy.FeatureToRaster_conversion(outFeatureClass, field_Type, outRaster, cellSize)
#remove the temporary fc
arcpy.Delete_management(outFeatureClass)
del SC, row
#now many rasters are created, representing the snow persistence types of each day.
#list all the rasters created
rasterList = arcpy.ListRasters("*", "All")
print rasterList
#now the rasters have values 1 and 0. the following loop will
#perform CON expressions in order to assign the value of snow persistence
for i in rasterList:
print i + ":"
inRaster = Raster(i)
#set the value of snow persistence, stored in the raster name
value_to_set = i[-5]
inTrueRaster = int(value_to_set)
inFalseConstant = 0
whereClause = "Value > 0"
# Check out the ArcGIS Spatial Analyst extension license
arcpy.CheckOutExtension("Spatial")
print 'Executing CON expression and deleting input'
# Execute Con , in order to assign to each pixel the value of snow persistence
print str(inTrueRaster)
try:
outCon = Con(inRaster, inTrueRaster, inFalseConstant, whereClause)
except:
print 'CON expression failed (probably empty raster!)'
nameoutput = i[:-4] + "_c.tif"
outCon.save(nameoutput)
#delete the temp rasters with values 0 and 1
arcpy.Delete_management(i)
#list the raster with values of snow persistence
rasterList = arcpy.ListRasters("*_c.tif", "All")
#sum the rasters
print "Caclulating SUM"
outCellStats = CellStatistics(rasterList, "SUM", "DATA")
#calculate the number of days (num of rasters/3)
print "Calculating day ratio"
num_of_rasters = len(rasterList)
print 'Num of rasters : ' + str(num_of_rasters)
num_of_days = num_of_rasters / 3
print 'Num of days : ' + str(num_of_days)
#in order to store decimal values, multiplicate the raster by 1000 before dividing
outCellStats = outCellStats * 1000 / num_of_days
#save the output raster
print "saving output " + stat_output_name
stat_output_name = os.path.join(output_folder,stat_output_name)
outCellStats.save(stat_output_name)
#delete the remaining temporary rasters
print "deleting CON rasters"
for i in rasterList:
print "deleting " + i
arcpy.Delete_management(i)
arcpy.Delete_management("lyr")
Could you rasterize your polygons into multiple layers, each pixel could contain your attribute value. Then merge the layers by averaging the attribute values?

Filtering rows in a csv file with Python

I have a script that opens and modifies a text file. The text file which contains personnel info and a lunch account balance. My script takes the text file removes the quotes and only writes rows that contain the values D, F or R in column 8. It writes this filtered data to two files, a csv import file called lunchimport.csv for a separate program and a csv temp file called to be used for further filtering. The second stage of the script uses the csv temp file to generate two additional csv files. One file, negativebal.csv, contains only rows with a negative value in column 14. The other file, lowbal.cav, contains rows with a value between 0 and 5 in column 14. My issue is that I cant get the script to filter "between" values properly. When using the code below to just write rows with values in column14 between 0 and 5 nothing will filter out. If I use values between 0 and 1.99 it works. Anything greater than 1.99 and the code doesnt filter anything:
if row[13] > "0" and row[13] < "1.99":
lowwriter.writerow([row[0], row[13]])
I have pasted my entire code below. I do use alot of temp files to accomplish my tasks. There probably is a better way but im just interested in getting my filters to work properly.
import os
import csv
infile = open("\\\\comalexsrv\\export\\update.txt", "r")
outfile1 = open("casttemp1.csv", "w")
infile2 = open("casttemp1.csv", "r")
outfile2 = open("casttemp2.csv", "w")
infile3 = open("casttemp2.csv", "r")
outfile3 = open("casttemp3.csv", "w")
infile4 = open("casttemp3.csv", "r")
inowcsv = open("F:\zbennett\Lunch_Imports\lunchimport.csv", "w")
negcastcsv = open("\\\\tcdc\\inow_transfer$\\negativebal.csv", "w")
lowcastcsv = open("\\\\tcdc\\inow_transfer$\\lowbal.csv", "w")
# Remove quotes in update.txt, write to outfile1(casttemp1.csv)
string = infile.read()
outfile1.write(string.replace("\"", ''))
# Open infile2(casttemp1.csv), write rows with D,F,R in column 8 to outfile2(casttemp2.csv)
# Open infile2(casttemp1.csv), write rows with D,F,R in column 8 to inowcsv(F:\zbennett\Lunch_Imports\lunchimport.csv)
# Open infile2(casttemp1.csv), write rows with D,R in column 8 to outfile3(casttemp3.csv)
tempwriter = csv.writer(outfile2, delimiter=',', lineterminator= '\n')
importwriter = csv.writer(inowcsv, delimiter=',', lineterminator= '\n')
lowtemp = csv.writer(outfile3, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile2, delimiter=','):
if row[7] == "D":
tempwriter.writerow(row)
importwriter.writerow(row)
lowtemp.writerow(row)
if row[7] == "F":
tempwriter.writerow(row)
importwriter.writerow(row)
if row[7] == "R":
tempwriter.writerow(row)
importwriter.writerow(row)
lowtemp.writerow(row)
# Open infile3(casttemp2.csv), write columns 1,14 for rows with less than 0 in column 14 to negcastcsv(\\tcdc\inow_transfer$\negativebal.csv)
negwriter = csv.writer(negcastcsv, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile3, delimiter=','):
if row[13] < "0":
negwriter.writerow([row[0], row[13]])
# Open infile4(casttemp3.csv), write columns 1,14 for rows with column 14 greater than 0 and less than 1.75 to lowcastcsv(\\tcdc\inow_transfer$\lowbal.csv)
lowwriter = csv.writer(lowcastcsv, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile4, delimiter=','):
if row[13] > "0" and row[13] < "1.99":
lowwriter.writerow([row[0], row[13]])
infile.close()
outfile1.close()
infile2.close()
outfile2.close()
inowcsv.close()
outfile3.close()
infile3.close()
infile4.close()
negcastcsv.close()
lowcastcsv.close()
# Delete casttemp1.csv file
os.remove("casttemp1.csv")
os.remove("casttemp2.csv")
os.remove("casttemp3.csv")
Comparison is happening using strings, when you probably want numeric comparison:
if 0. < float(row[13]) < 1.99:
lowwriter.writerow([row[0], row[13]])

HBase shell scan bytes to string conversion

I would like to scan hbase table and see integers as strings (not their binary representation). I can do the conversion but have no idea how to write scan statement by using Java API from hbase shell:
org.apache.hadoop.hbase.util.Bytes.toString(
"\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65".to_java_bytes)
org.apache.hadoop.hbase.util.Bytes.toString("Hello HBase".to_java_bytes)
I will be very happy to have examples of scan, get that searching binary data (long's) and output normal strings. I am using hbase shell, not JAVA.
HBase stores data as byte arrays (untyped). Therefore if you perform a table scan data will be displayed in a common format (escaped hexadecimal string), e.g: "\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65" -> Hello HBase
If you want to get back the typed value from the serialized byte array you have to do this manually.
You have the following options:
Java code (Bytes.toString(...))
hack the to_string function in $HBASE/HOME/lib/ruby/hbase/table.rb :
replace toStringBinary with toInt for non-meta tables
write a get/scan JRuby function which converts the byte array to the appropriate type
Since you want it HBase shell, then consider the last option:
Create a file get_result.rb :
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;
# Simple function equivalent to scan 'test', {COLUMNS => 'c:c2'}
def get_result()
htable = HTable.new(HBaseConfiguration.new, "test")
rs = htable.getScanner(Bytes.toBytes("c"), Bytes.toBytes("c2"))
output = ArrayList.new
output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
rs.each { |r|
r.raw.each { |kv|
row = Bytes.toString(kv.getRow)
fam = Bytes.toString(kv.getFamily)
ql = Bytes.toString(kv.getQualifier)
ts = kv.getTimestamp
val = Bytes.toInt(kv.getValue)
output.add " #{row} \t\t\t\t\t\t column=#{fam}:#{ql}, timestamp=#{ts}, value=#{val}"
}
}
output.each {|line| puts "#{line}\n"}
end
load it in the HBase shell and use it:
require '/path/to/get_result'
get_result
Note: modify/enhance/fix the code according to your needs
Just for completeness' sake, it turns out that the call Bytes::toStringBinary gives the hex-escaped sequence you get in HBase shell:
\x0B\x2_SOME_ASCII_TEXT_\x10\x00...
Whereas, Bytes::toString will try to deserialize to a string assuming UTF8, which will look more like:
\u8900\u0710\u0115\u0320\u0000_SOME_UTF8_TEXT_\u4009...
you can add a scan_counter command to the hbase shell.
first:
add to /usr/lib/hbase/lib/ruby/hbase/table.rb (after the scan function):
#----------------------------------------------------------------------------------------------
# Scans whole table or a range of keys and returns rows matching specific criterias with values as number
def scan_counter(args = {})
unless args.kind_of?(Hash)
raise ArgumentError, "Arguments should be a hash. Failed to parse #{args.inspect}, #{args.class}"
end
limit = args.delete("LIMIT") || -1
maxlength = args.delete("MAXLENGTH") || -1
if args.any?
filter = args["FILTER"]
startrow = args["STARTROW"] || ''
stoprow = args["STOPROW"]
timestamp = args["TIMESTAMP"]
columns = args["COLUMNS"] || args["COLUMN"] || get_all_columns
cache = args["CACHE_BLOCKS"] || true
versions = args["VERSIONS"] || 1
timerange = args[TIMERANGE]
# Normalize column names
columns = [columns] if columns.class == String
unless columns.kind_of?(Array)
raise ArgumentError.new("COLUMNS must be specified as a String or an Array")
end
scan = if stoprow
org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes, stoprow.to_java_bytes)
else
org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes)
end
columns.each { |c| scan.addColumns(c) }
scan.setFilter(filter) if filter
scan.setTimeStamp(timestamp) if timestamp
scan.setCacheBlocks(cache)
scan.setMaxVersions(versions) if versions > 1
scan.setTimeRange(timerange[0], timerange[1]) if timerange
else
scan = org.apache.hadoop.hbase.client.Scan.new
end
# Start the scanner
scanner = #table.getScanner(scan)
count = 0
res = {}
iter = scanner.iterator
# Iterate results
while iter.hasNext
if limit > 0 && count >= limit
break
end
row = iter.next
key = org.apache.hadoop.hbase.util.Bytes::toStringBinary(row.getRow)
row.list.each do |kv|
family = String.from_java_bytes(kv.getFamily)
qualifier = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getQualifier)
column = "#{family}:#{qualifier}"
cell = to_string_scan_counter(column, kv, maxlength)
if block_given?
yield(key, "column=#{column}, #{cell}")
else
res[key] ||= {}
res[key][column] = cell
end
end
# One more row processed
count += 1
end
return ((block_given?) ? count : res)
end
#----------------------------------------------------------------------------------------
# Helper methods
# Returns a list of column names in the table
def get_all_columns
#table.table_descriptor.getFamilies.map do |family|
"#{family.getNameAsString}:"
end
end
# Checks if current table is one of the 'meta' tables
def is_meta_table?
tn = #table.table_name
org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::META_TABLE_NAME) || org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::ROOT_TABLE_NAME)
end
# Returns family and (when has it) qualifier for a column name
def parse_column_name(column)
split = org.apache.hadoop.hbase.KeyValue.parseColumn(column.to_java_bytes)
return split[0], (split.length > 1) ? split[1] : nil
end
# Make a String of the passed kv
# Intercept cells whose format we know such as the info:regioninfo in .META.
def to_string(column, kv, maxlength = -1)
if is_meta_table?
if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
end
if column == 'info:serverstartcode'
if kv.getValue.length > 0
str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
else
str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
end
return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
end
end
val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getValue)}"
(maxlength != -1) ? val[0, maxlength] : val
end
def to_string_scan_counter(column, kv, maxlength = -1)
if is_meta_table?
if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
end
if column == 'info:serverstartcode'
if kv.getValue.length > 0
str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
else
str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
end
return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
end
end
val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toLong(kv.getValue)}"
(maxlength != -1) ? val[0, maxlength] : val
end
second:
add to /usr/lib/hbase/lib/ruby/shell/commands/
the following file called: scan_counter.rb
#
# Copyright 2010 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
module Shell
module Commands
class ScanCounter < Command
def help
return <<-EOF
Scan a table with cell value that is long; pass table name and optionally a dictionary of scanner
specifications. Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS. If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
'col_family:'.
Some examples:
hbase> scan_counter '.META.'
hbase> scan_counter '.META.', {COLUMNS => 'info:regioninfo'}
hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
hbase> scan_counter 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
hbase> scan_counter 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false). By
default it is enabled. Examples:
hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
EOF
end
def command(table, args = {})
now = Time.now
formatter.header(["ROW", "COLUMN+CELL"])
count = table(table).scan_counter(args) do |row, cells|
formatter.row([ row, cells ])
end
formatter.footer(now, count)
end
end
end
end
finally
add to /usr/lib/hbase/lib/ruby/shell.rb the function scan_counter.
replace the current function with this: (you can identify it by: 'DATA MANIPULATION COMMANDS',)
Shell.load_command_group(
'dml',
:full_name => 'DATA MANIPULATION COMMANDS',
:commands => %w[
count
delete
deleteall
get
get_counter
incr
put
scan
scan_counter
truncate
]
)