I am parsing the links on wikipedia pages of actors, and trying to find links to films they appeared in.
I have a basic method that searchs the links and checks for the word film in the link. However many of the links to films do not actually contain this word.
However, within the paragraphs that the links are contained in, the word film appears , for example:
<p>Dreyfuss's first film part was a small, uncredited role in
<i><a href="/wiki/The_Graduate" title="The Graduate">The Graduate
// Paragraph goes on for a long time.
Here is the block from the method that checks all the links:
all_links = doca.search('//a[#href]')
all_links.each do |link|
link_info = link['href']
if link_info.include?("(film)") && !(link_info.include?("Category:") || link_info.include?("php"))
then out << link_info end
end
out.uniq.collect {|link| strip_out_name(link)}
Would there be a way of checking the previous text before the link but after the <p> tag for the word film, but being careful not to check other links (and also perhaps limited the search to 50 characters before the link)?
Thanks for any help or suggestions.
Click here, this is the main page that I am testing on
It is possible to search for text inside a tag. See https://stackoverflow.com/a/19816840/128421 for an example.
But, I'd do it something similar to this way:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/Richard_Dreyfuss'))
table = doc.at('#Filmography').parent.next_element
films = table.search('tr')[1..-1].map{ |tr|
tds = tr.search('td')
year = tds.shift.text
movie = tds.shift
movie_url = movie.at('a')['href']
movie_title = movie.at('a').text
role = tds.shift.text
{
year: year,
movie_url: movie_url,
movie_title: movie_title,
role: role
}
}
films
# => [{:year=>"1966",
# :movie_url=>"/wiki/Bewitched",
# :movie_title=>"Bewitched",
# :role=>"Rodney"},
# {:year=>"1966",
# :movie_url=>"/wiki/Gidget_(TV_series)",
# :movie_title=>"Gidget",
# :role=>"Durf the Drag"},
# {:year=>"1967",
# :movie_url=>"/wiki/Valley_of_the_Dolls_(film)",
# :movie_title=>"Valley of the Dolls",
# :role=>"Assistant stage manager"},
# {:year=>"1967",
# :movie_url=>"/wiki/The_Graduate",
# :movie_title=>"The Graduate",
# :role=>"Boarding House Resident"},
# {:year=>"1967",
# :movie_url=>"/wiki/The_Big_Valley",
# :movie_title=>"The Big Valley",
# :role=>"Lud Akley"},
# {:year=>"1968",
# :movie_url=>"/wiki/The_Young_Runaways",
# :movie_title=>"The Young Runaways",
# :role=>"Terry"},
# {:year=>"1969",
# :movie_url=>"/wiki/Hello_Down_There",
# :movie_title=>"Hello Down There",
# :role=>"Harold Webster"},
# {:year=>"1970",
# :movie_url=>"/wiki/The_Mod_Squad",
# :movie_title=>"The Mod Squad",
# :role=>"Curtis Bell"},
# {:year=>"1973",
# :movie_url=>"/wiki/American_Graffiti",
# :movie_title=>"American Graffiti",
# :role=>"Curt Henderson"},
# {:year=>"1973",
# :movie_url=>"/wiki/Dillinger_(1973_film)",
# :movie_title=>"Dillinger",
# :role=>"Baby Face Nelson"},
# {:year=>"1974",
# :movie_url=>"/wiki/The_Apprenticeship_of_Duddy_Kravitz_(film)",
# :movie_title=>"The Apprenticeship of Duddy Kravitz",
# :role=>"Duddy"},
# {:year=>"1974",
# :movie_url=>"/wiki/The_Second_Coming_of_Suzanne",
# :movie_title=>"The Second Coming of Suzanne",
# :role=>"Clavius"},
# {:year=>"1975",
# :movie_url=>"/wiki/Inserts_(film)",
# :movie_title=>"Inserts",
# :role=>"The Boy Wonder"},
# {:year=>"1975",
# :movie_url=>"/wiki/Jaws_(film)",
# :movie_title=>"Jaws",
# :role=>"Matt Hooper"},
# {:year=>"1976",
# :movie_url=>"/wiki/Victory_at_Entebbe",
# :movie_title=>"Victory at Entebbe",
# :role=>"Colonel Yonatan 'Yonni' Netanyahu"},
# {:year=>"1977",
# :movie_url=>"/wiki/Close_Encounters_of_the_Third_Kind",
# :movie_title=>"Close Encounters of the Third Kind",
# :role=>"Roy Neary"},
# {:year=>"1977",
# :movie_url=>"/wiki/The_Goodbye_Girl",
# :movie_title=>"The Goodbye Girl",
# :role=>"Elliott Garfield"},
# {:year=>"1978",
# :movie_url=>"/wiki/The_Big_Fix",
# :movie_title=>"The Big Fix",
# :role=>"Moses Wine"},
# {:year=>"1980",
# :movie_url=>"/wiki/The_Competition_(film)",
# :movie_title=>"The Competition",
# :role=>"Paul Dietrich"},
# {:year=>"1981",
# :movie_url=>"/wiki/Whose_Life_Is_It_Anyway%3F_(1981_film)",
# :movie_title=>"Whose Life Is It Anyway?",
# :role=>"Ken Harrison"},
# {:year=>"1984",
# :movie_url=>"/wiki/The_Buddy_System_(film)",
# :movie_title=>"The Buddy System",
# :role=>"Joe"},
# {:year=>"1986",
# :movie_url=>"/wiki/Down_and_Out_in_Beverly_Hills",
# :movie_title=>"Down and Out in Beverly Hills",
# :role=>"David 'Dave' Whiteman"},
# {:year=>"1986",
# :movie_url=>"/wiki/Stand_by_Me_(film)",
# :movie_title=>"Stand by Me",
# :role=>"Narrator/Gordie LaChance (adult)"},
# {:year=>"1987",
# :movie_url=>"/wiki/Tin_Men",
# :movie_title=>"Tin Men",
# :role=>"Bill 'BB' Babowsky"},
# {:year=>"1987",
# :movie_url=>"/wiki/Stakeout_(1987_film)",
# :movie_title=>"Stakeout",
# :role=>"Det. Chris Lecce"},
# {:year=>"1987",
# :movie_url=>"/wiki/Nuts_(film)",
# :movie_title=>"Nuts",
# :role=>"Aaron Levinsky"},
# {:year=>"1988",
# :movie_url=>"/wiki/Moon_Over_Parador",
# :movie_title=>"Moon Over Parador",
# :role=>"Jack Noah/President Alphonse Simms"},
# {:year=>"1989",
# :movie_url=>"/wiki/Let_It_Ride_(film)",
# :movie_title=>"Let It Ride",
# :role=>"Jay Trotter"},
# {:year=>"1989",
# :movie_url=>"/wiki/Always_(1989_film)",
# :movie_title=>"Always",
# :role=>"Pete Sandich"},
# {:year=>"1990",
# :movie_url=>"/wiki/Rosencrantz_%26_Guildenstern_Are_Dead_(film)",
# :movie_title=>"Rosencrantz & Guildenstern Are Dead",
# :role=>"The Player"},
# {:year=>"1990",
# :movie_url=>"/wiki/Postcards_from_the_Edge_(film)",
# :movie_title=>"Postcards from the Edge",
# :role=>"Doctor Frankenthal"},
# {:year=>"1991",
# :movie_url=>"/wiki/Once_Around",
# :movie_title=>"Once Around",
# :role=>"Sam Sharpe"},
# {:year=>"1991",
# :movie_url=>"/wiki/Prisoner_of_Honor",
# :movie_title=>"Prisoner of Honor",
# :role=>"Col. Picquart"},
# {:year=>"1991",
# :movie_url=>"/wiki/What_About_Bob%3F",
# :movie_title=>"What About Bob?",
# :role=>"Dr. Leo Marvin"},
# {:year=>"1993",
# :movie_url=>"/wiki/Lost_in_Yonkers_(film)",
# :movie_title=>"Lost in Yonkers",
# :role=>"Louie Kurnitz"},
# {:year=>"1993",
# :movie_url=>"/wiki/Another_Stakeout",
# :movie_title=>"Another Stakeout",
# :role=>"Detective Chris Lecce"},
# {:year=>"1994",
# :movie_url=>"/wiki/Silent_Fall",
# :movie_title=>"Silent Fall",
# :role=>"Dr. Jake Rainer"},
# {:year=>"1995",
# :movie_url=>
# "/w/index.php?title=The_Last_Word_(1995_film)&action=edit&redlink=1",
# :movie_title=>"The Last Word",
# :role=>"Larry"},
# {:year=>"1995",
# :movie_url=>"/wiki/The_American_President_(film)",
# :movie_title=>"The American President",
# :role=>"Senator Bob Rumson"},
# {:year=>"1995",
# :movie_url=>"/wiki/Mr._Holland%27s_Opus",
# :movie_title=>"Mr. Holland's Opus",
# :role=>"Glenn Holland"},
# {:year=>"1996",
# :movie_url=>"/wiki/James_and_the_Giant_Peach_(film)",
# :movie_title=>"James and the Giant Peach",
# :role=>"Centipede (voice)"},
# {:year=>"1996",
# :movie_url=>"/wiki/Mad_Dog_Time",
# :movie_title=>"Mad Dog Time",
# :role=>"Vic"},
# {:year=>"1997",
# :movie_url=>"/wiki/Night_Falls_on_Manhattan",
# :movie_title=>"Night Falls on Manhattan",
# :role=>"Sam Vigoda"},
# {:year=>"1997",
# :movie_url=>"/wiki/Oliver_Twist_(1997_film)",
# :movie_title=>"Oliver Twist",
# :role=>"Fagin"},
# {:year=>"1998",
# :movie_url=>"/wiki/Krippendorf%27s_Tribe",
# :movie_title=>"Krippendorf's Tribe",
# :role=>"Prof. James Krippendorf"},
# {:year=>"1999",
# :movie_url=>"/wiki/Lansky_(film)",
# :movie_title=>"Lansky",
# :role=>"Meyer Lansky"},
# {:year=>"2000",
# :movie_url=>"/wiki/The_Crew_(2000_film)",
# :movie_title=>"The Crew",
# :role=>"Bobby Bartellemeo/Narrator"},
# {:year=>"2000",
# :movie_url=>"/wiki/Fail_Safe_(2000_TV)",
# :movie_title=>"Fail Safe",
# :role=>"President of the United States"},
# {:year=>"2001",
# :movie_url=>"/wiki/The_Old_Man_Who_Read_Love_Stories",
# :movie_title=>"The Old Man Who Read Love Stories",
# :role=>"Antonio Bolivar"},
# {:year=>"2001",
# :movie_url=>"/wiki/Who_Is_Cletis_Tout%3F",
# :movie_title=>"Who Is Cletis Tout?",
# :role=>"Micah Donnelly"},
# {:year=>"2001",
# :movie_url=>"/wiki/The_Education_of_Max_Bickford",
# :movie_title=>"The Education of Max Bickford",
# :role=>"Max Bickford"},
# {:year=>"2001",
# :movie_url=>"/wiki/The_Day_Reagan_Was_Shot",
# :movie_title=>"The Day Reagan Was Shot",
# :role=>"Alexander Haig"},
# {:year=>"2003",
# :movie_url=>"/wiki/Coast_to_Coast_(TV_film)",
# :movie_title=>"Coast to Coast",
# :role=>"Barnaby Pierce"},
# {:year=>"2004",
# :movie_url=>"/wiki/Silver_City_(2004_film)",
# :movie_title=>"Silver City",
# :role=>"Chuck Raven"},
# {:year=>"2006",
# :movie_url=>"/wiki/Poseidon_(film)",
# :movie_title=>"Poseidon",
# :role=>"Richard Nelson"},
# {:year=>"2007",
# :movie_url=>"/wiki/Tin_Man_(TV_miniseries)",
# :movie_title=>"Tin Man",
# :role=>"Mystic Man"},
# {:year=>"2007",
# :movie_url=>"/wiki/Ocean_of_Fear",
# :movie_title=>"Ocean of Fear",
# :role=>"Narrator"},
# {:year=>"2008",
# :movie_url=>"/wiki/Signs_of_the_Time_(film)",
# :movie_title=>"Signs of the Time",
# :role=>"Narrator"},
# {:year=>"2008",
# :movie_url=>"/wiki/W._(film)",
# :movie_title=>"W.",
# :role=>"Dick Cheney"},
# {:year=>"2008",
# :movie_url=>"/w/index.php?title=America_Betrayed&action=edit&redlink=1",
# :movie_title=>"America Betrayed",
# :role=>"Narrator"},
# {:year=>"2009",
# :movie_url=>"/wiki/My_Life_in_Ruins",
# :movie_title=>"My Life in Ruins",
# :role=>"Irv"},
# {:year=>"2009",
# :movie_url=>"/wiki/Leaves_of_Grass_(film)",
# :movie_title=>"Leaves of Grass",
# :role=>"Pug Rothbaum"},
# {:year=>"2009",
# :movie_url=>"/wiki/The_Lightkeepers",
# :movie_title=>"The Lightkeepers",
# :role=>"Seth"},
# {:year=>"2010",
# :movie_url=>"/wiki/Piranha_3D",
# :movie_title=>"Piranha 3D",
# :role=>"Matthew Boyd"},
# {:year=>"2010",
# :movie_url=>"/wiki/Weeds_(TV_series)",
# :movie_title=>"Weeds",
# :role=>"Warren Schiff"},
# {:year=>"2010",
# :movie_url=>"/wiki/RED_(film)",
# :movie_title=>"RED",
# :role=>"Alexander Dunning"},
# {:year=>"2012",
# :movie_url=>"/wiki/Coma_(U.S._miniseries)",
# :movie_title=>"Coma",
# :role=>"Professor Hillside"},
# {:year=>"2013",
# :movie_url=>"/wiki/Very_Good_Girls",
# :movie_title=>"Very Good Girls",
# :role=>"Danny, Gerry's father"},
# {:year=>"2013",
# :movie_url=>"/wiki/Paranoia_(2013_film)",
# :movie_title=>"Paranoia",
# :role=>"Francis Cassidy"}]
To explain what it's doing:
The "Filmology" table is a good source for the information; It's organized logically, so writing code to walk through it is easy.
doc.at('#Filmography').parent.next_element
finds that table using the <h2> heading just above it, then backs up and looks in the next tag, which is the table itself.
table.search('tr')[1..-1] finds the <tr> rows inside the table, skips the first, then iterates (using map) over the remaining ones.
tds = tr.search('td') finds the cells for the table. From that point on it's a matter of peeling that NodeSet apart like an array, by looking at the elements I want. The rest of the code should be pretty obvious. Once the individual parts are retrieved that are of interest they're bundled into a hash, which is returned as part of an array of hashes by map.
Why not try parsing out the filmography section of the wikipedia article? It seems pretty standard across the few actors that I looked at, and it mentions whether or not it was a TV series so you could filter those out easily.
<tr>
<td>1966</td>
<td><i>Gidget</i></td>
<td>Durf the Drag</td>
<td>TV series 1 episode</td>
</tr>
<tr>
<td>1967</td>
<td><i>Valley of the Dolls</i></td>
<td>Assistant stage manager</td>
<td>Uncredited</td>
</tr>
Looks like you could pull nodes similar to this from the code and save all the info to do what you want with it. The first node could be disregarded since "TV" appears multiple times in the different subnodes.
Hope this helps!
-Larry
Okay So I have tested the code based on your actual request and come up with the following
url = "http://en.wikipedia.org/wiki/Richard_Dreyfuss"
doc = Nokogiri::HTML(open(url))
all_links = doc.search("//a[#href]")
all_links.each do |link|
p_text = link.ancestors("p").text
link_index = p_text.index(link.text)
unless link_index.nil?
search_back = link_index > 50 ? link_index - 50 : 0
p_text[search_back..link_index].downcase.include?("film") ? puts(link['href']) : nil
end
end
Output
#=>/wiki/American_Graffiti
/wiki/Jaws_(film)
/wiki/Close_Encounters_of_the_Third_Kind
/wiki/The_Graduate
/wiki/The_Apprenticeship_of_Duddy_Kravitz_(film)
/wiki/Down_And_Out_In_Beverly_Hills
/wiki/Stakeout_(1987_film)
/wiki/Stephen_King
/wiki/The_Body_(novella)
/wiki/Poseidon_(film)
#cite_note-27
/wiki/Jonathan_Tasini
This seems to satisfy the question you were asking but obviously needs to be modified to fit your needs.
Edit
Added your request for running back on 50 characters in the paragraph the response is much shorter now but I am not sure that the results will be as useful as you'd like. This answers the question but does not capture exactly what you are hoping for e.g. the last 2 links are not to films but they are within 50 characters of the world film.
I've istalled sphinx on my server to try it out. I've set up a simple search source, run indexer once and it worked well. Then I've stopped sphinx process and it is not running for a few weeks now:
user#server ~ $ ps aux | grep sphinx
user 5919 0.0 0.0 13584 920 pts/5 S+ 12:07 0:00 grep --colour=auto sphinx
user#server ~ $ ps aux | grep index
user 5921 0.0 0.0 13584 916 pts/5 S+ 12:07 0:00 grep --colour=auto index
user#server ~ $ ps aux | grep search
user 5925 0.0 0.0 13584 916 pts/5 S+ 12:07 0:00 grep --colour=auto search
But yesterday I've noticed an unussually big memory usage on my mysql database server. In show processlist; I saw a query that I programmed in sphinx sources: SELECT id, content FROM articles.
Why is this happening if sphinx is stopped? How to stop sphinx from executing the queries?
My shpinx.conf:
#
# Sphinx configuration file sample
#
# WARNING! While this sample file mentions all available options,
# it contains (very) short helper descriptions only. Please refer to
# doc/sphinx.html for details.
#
#############################################################################
## data source definition
#############################################################################
source src1
{
# data source type. mandatory, no default value
# known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc
type = mysql
#####################################################################
## SQL settings (for 'mysql' and 'pgsql' types)
#####################################################################
# some straightforward parameters for SQL source types
sql_host = 85.254.49.181
sql_user = root
sql_pass = #########
sql_db = articles_db
sql_port = 3306 # optional, default is 3306
# UNIX socket name
# optional, default is empty (reuse client library defaults)
# usually '/var/lib/mysql/mysql.sock' on Linux
# usually '/tmp/mysql.sock' on FreeBSD
#
# sql_sock = /tmp/mysql.sock
# MySQL specific client connection flags
# optional, default is 0
#
# mysql_connect_flags = 32 # enable compression
# MySQL specific SSL certificate settings
# optional, defaults are empty
#
# mysql_ssl_cert = /etc/ssl/client-cert.pem
# mysql_ssl_key = /etc/ssl/client-key.pem
# mysql_ssl_ca = /etc/ssl/cacert.pem
# MS SQL specific Windows authentication mode flag
# MUST be in sync with charset_type index-level setting
# optional, default is 0
#
# mssql_winauth = 1 # use currently logged on user credentials
# MS SQL specific Unicode indexing flag
# optional, default is 0 (request SBCS data)
#
# mssql_unicode = 1 # request Unicode data from server
# ODBC specific DSN (data source name)
# mandatory for odbc source type, no default value
#
# odbc_dsn = DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)};
# sql_query = SELECT id, data FROM documents.csv
# pre-query, executed before the main fetch query
# multi-value, optional, default is empty list of queries
#
sql_query_pre = SET NAMES utf8
# sql_query_pre = SET SESSION query_cache_type=OFF
# main document fetch query
# mandatory, integer document ID field MUST be the first selected column
sql_query = \
SELECT id, content \
FROM articles
# range query setup, query that must return min and max ID values
# optional, default is empty
#
# sql_query will need to reference $start and $end boundaries
# if using ranged query:
#
# sql_query = \
# SELECT doc.id, doc.id AS group, doc.title, doc.data \
# FROM documents doc \
# WHERE id>=$start AND id<=$end
#
# sql_query_range = SELECT MIN(id),MAX(id) FROM documents
# range query step
# optional, default is 1024
#
# sql_range_step = 1000
# unsigned integer attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# optional bit size can be specified, default is 32
#
# sql_attr_uint = author_id
# sql_attr_uint = forum_id:9 # 9 bits for forum_id
#sql_attr_uint = id
# boolean attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# equivalent to sql_attr_uint with 1-bit size
#
# sql_attr_bool = is_deleted
# bigint attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# declares a signed (unlike uint!) 64-bit attribute
#
# sql_attr_bigint = my_bigint_id
# UNIX timestamp attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# similar to integer, but can also be used in date functions
#
# sql_attr_timestamp = posted_ts
# sql_attr_timestamp = last_edited_ts
#sql_attr_timestamp = date_added
# string ordinal attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# sorts strings (bytewise), and stores their indexes in the sorted list
# sorting by this attr is equivalent to sorting by the original strings
#
# sql_attr_str2ordinal = author_name
# floating point attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# values are stored in single precision, 32-bit IEEE 754 format
#
# sql_attr_float = lat_radians
# sql_attr_float = long_radians
# multi-valued attribute (MVA) attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# MVA values are variable length lists of unsigned 32-bit integers
#
# syntax is ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY]
# ATTR-TYPE is 'uint' or 'timestamp'
# SOURCE-TYPE is 'field', 'query', or 'ranged-query'
# QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
# RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'
#
# sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
# sql_attr_multi = uint tag from ranged-query; \
# SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
# SELECT MIN(id), MAX(id) FROM tags
# post-query, executed on sql_query completion
# optional, default is empty
#
# sql_query_post =
# post-index-query, executed on successful indexing completion
# optional, default is empty
# $maxid expands to max document ID actually fetched from DB
#
# sql_query_post_index = REPLACE INTO counters ( id, val ) \
# VALUES ( 'max_indexed_id', $maxid )
# ranged query throttling, in milliseconds
# optional, default is 0 which means no delay
# enforces given delay before each query step
sql_ranged_throttle = 0
# document info query, ONLY for CLI search (ie. testing and debugging)
# optional, default is empty
# must contain $id macro and must fetch the document by that id
sql_query_info = SELECT * FROM articles WHERE id=$id
# kill-list query, fetches the document IDs for kill-list
# k-list will suppress matches from preceding indexes in the same query
# optional, default is empty
#
# sql_query_killlist = SELECT id FROM documents WHERE edited>=#last_reindex
# columns to unpack on indexer side when indexing
# multi-value, optional, default is empty list
#
# unpack_zlib = zlib_column
# unpack_mysqlcompress = compressed_column
# unpack_mysqlcompress = compressed_column_2
# maximum unpacked length allowed in MySQL COMPRESS() unpacker
# optional, default is 16M
#
# unpack_mysqlcompress_maxsize = 16M
#####################################################################
## xmlpipe settings
#####################################################################
# type = xmlpipe
# shell command to invoke xmlpipe stream producer
# mandatory
#
# xmlpipe_command = cat /var/lib/sphinxsearch/test.xml
#####################################################################
## xmlpipe2 settings
#####################################################################
# type = xmlpipe2
# xmlpipe_command = cat /var/lib/sphinxsearch/test2.xml
# xmlpipe2 field declaration
# multi-value, optional, default is empty
#
# xmlpipe_field = subject
# xmlpipe_field = content
# xmlpipe2 attribute declaration
# multi-value, optional, default is empty
# all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX
#
# xmlpipe_attr_timestamp = published
# xmlpipe_attr_uint = author_id
# perform UTF-8 validation, and filter out incorrect codes
# avoids XML parser choking on non-UTF-8 documents
# optional, default is 0
#
# xmlpipe_fixup_utf8 = 1
}
# inherited source example
#
# all the parameters are copied from the parent source,
# and may then be overridden in this source definition
source src1throttled : src1
{
sql_ranged_throttle = 100
}
#############################################################################
## index definition
#############################################################################
# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
index articles
{
# document source(s) to index
# multi-value, mandatory
# document IDs must be globally unique across all sources
source = src1
# index files path and file name, without extension
# mandatory, path must be writable, extensions will be auto-appended
path = /var/lib/sphinxsearch/data/parts
# document attribute values (docinfo) storage mode
# optional, default is 'extern'
# known values are 'none', 'extern' and 'inline'
docinfo = extern
# memory locking for cached data (.spa and .spi), to prevent swapping
# optional, default is 0 (do not mlock)
# requires searchd to be run from root
mlock = 0
# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology = stem_en, stem_ru, soundex
# morphology = libstemmer_german
# morphology = libstemmer_sv
morphology = stem_ru
# minimum word length at which to enable stemming
# optional, default is 1 (stem everything)
#
# min_stemming_len = 1
# stopword files list (space separated)
# optional, default is empty
# contents are plain text, charset_table and stemming are both applied
#
# stopwords = /var/lib/sphinxsearch/data/stopwords.txt
# wordforms file, in "mapfrom > mapto" plain text format
# optional, default is empty
#
# wordforms = /var/lib/sphinxsearch/data/wordforms.txt
# tokenizing exceptions file
# optional, default is empty
#
# plain text, case sensitive, space insensitive in map-from part
# one "Map Several Words => ToASingleOne" entry per line
#
# exceptions = /var/lib/sphinxsearch/data/exceptions.txt
# minimum indexed word length
# default is 1 (index everything)
min_word_len = 1
# charset encoding type
# optional, default is 'sbcs'
# known types are 'sbcs' (Single Byte CharSet) and 'utf-8'
charset_type = utf-8
# charset definition and case folding rules "table"
# optional, default value depends on charset_type
#
# defaults are configured to include English and Russian characters only
# you need to change the table to include additional ones
# this behavior MAY change in future versions
#
# 'sbcs' default value is
# charset_table = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
#
# 'utf-8' default value is
# charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
# ignored characters list
# optional, default value is empty
#
# ignore_chars = U+00AD
# minimum word prefix length to index
# optional, default is 0 (do not index prefixes)
#
# min_prefix_len = 0
# minimum word infix length to index
# optional, default is 0 (do not index infixes)
#
# min_infix_len = 0
# list of fields to limit prefix/infix indexing to
# optional, default value is empty (index all fields in prefix/infix mode)
#
# prefix_fields = filename
# infix_fields = url, domain
# enable star-syntax (wildcards) when searching prefix/infix indexes
# known values are 0 and 1
# optional, default is 0 (do not use wildcard syntax)
#
# enable_star = 1
# n-gram length to index, for CJK indexing
# only supports 0 and 1 for now, other lengths to be implemented
# optional, default is 0 (disable n-grams)
#
# ngram_len = 1
# n-gram characters list, for CJK indexing
# optional, default is empty
#
# ngram_chars = U+3000..U+2FA1F
# phrase boundary characters list
# optional, default is empty
#
# phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
# phrase boundary word position increment
# optional, default is 0
#
# phrase_boundary_step = 100
# whether to strip HTML tags from incoming documents
# known values are 0 (do not strip) and 1 (do strip)
# optional, default is 0
html_strip = 0
# what HTML attributes to index if stripping HTML
# optional, default is empty (do not index anything)
#
# html_index_attrs = img=alt,title; a=title;
# what HTML elements contents to strip
# optional, default is empty (do not strip element contents)
#
# html_remove_elements = style, script
# whether to preopen index data files on startup
# optional, default is 0 (do not preopen), searchd-only
#
# preopen = 1
# whether to keep dictionary (.spi) on disk, or cache it in RAM
# optional, default is 0 (cache in RAM), searchd-only
#
# ondisk_dict = 1
# whether to enable in-place inversion (2x less disk, 90-95% speed)
# optional, default is 0 (use separate temporary files), indexer-only
#
# inplace_enable = 1
# in-place fine-tuning options
# optional, defaults are listed below
#
# inplace_hit_gap = 0 # preallocated hitlist gap size
# inplace_docinfo_gap = 0 # preallocated docinfo gap size
# inplace_reloc_factor = 0.1 # relocation buffer size within arena
# inplace_write_factor = 0.1 # write buffer size within arena
# whether to index original keywords along with stemmed versions
# enables "=exactform" operator to work
# optional, default is 0
#
# index_exact_words = 1
# position increment on overshort (less that min_word_len) words
# optional, allowed values are 0 and 1, default is 1
#
# overshort_step = 1
# position increment on stopword
# optional, allowed values are 0 and 1, default is 1
#
# stopword_step = 1
}
# inherited index example
#
# all the parameters are copied from the parent index,
# and may then be overridden in this index definition
#index test1stemmed : test1
#{
# path = /var/lib/sphinxsearch/data/test1stemmed
# morphology = stem_en
#}
# distributed index example
#
# this is a virtual index which can NOT be directly indexed,
# and only contains references to other local and/or remote indexes
############################################################################
## indexer settings
#############################################################################
indexer
{
# memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
# optional, default is 32M, max is 2047M, recommended is 256M to 1024M
mem_limit = 32M
# maximum IO calls per second (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iops = 40
# maximum IO call size, bytes (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iosize = 1048576
# maximum xmlpipe2 field length, bytes
# optional, default is 2M
#
# max_xmlpipe2_field = 4M
# write buffer size, bytes
# several (currently up to 4) buffers will be allocated
# write buffers are allocated in addition to mem_limit
# optional, default is 1M
#
# write_buffer = 1M
}
#############################################################################
## searchd settings
#############################################################################
searchd
{
# hostname, port, or hostname:port, or /unix/socket/path to listen on
# multi-value, multiple listen points are allowed
# optional, default is 0.0.0.0:9312 (listen on all interfaces, port 9312)
#
#listen = localhost:9312
#listen = 0.0.0.0:9306:mysql41
# listen = 192.168.0.1:9312
# listen = 9312
# listen = /var/run/searchd.sock
listen = 0.0.0.0:9306:mysql41
# log file, searchd run info is logged here
# optional, default is 'searchd.log'
log = /var/log/sphinxsearch/searchd.log
# query log file, all search queries are logged here
# optional, default is empty (do not log queries)
query_log = /var/log/sphinxsearch/query.log
# client read timeout, seconds
# optional, default is 5
read_timeout = 5
# request timeout, seconds
# optional, default is 5 minutes
client_timeout = 300
# maximum amount of children to fork (concurrent searches to run)
# optional, default is 0 (unlimited)
max_children = 30
# PID file, searchd process ID file name
# mandatory
pid_file = /var/run/sphinxsearch/searchd.pid
# max amount of matches the daemon ever keeps in RAM, per-index
# WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
# default is 1000 (just like Google)
max_matches = 1000
# seamless rotate, prevents rotate stalls if precaching huge datasets
# optional, default is 1
seamless_rotate = 1
# whether to forcibly preopen all indexes on startup
# optional, default is 0 (do not preopen)
preopen_indexes = 0
# whether to unlink .old index copies on succesful rotation.
# optional, default is 1 (do unlink)
unlink_old = 1
# attribute updates periodic flush timeout, seconds
# updates will be automatically dumped to disk this frequently
# optional, default is 0 (disable periodic flush)
#
# attr_flush_period = 900
# instance-wide ondisk_dict defaults (per-index value take precedence)
# optional, default is 0 (precache all dictionaries in RAM)
#
# ondisk_dict_default = 1
# MVA updates pool size
# shared between all instances of searchd, disables attr flushes!
# optional, default size is 1M
mva_updates_pool = 1M
# max allowed network packet size
# limits both query packets from clients, and responses from agents
# optional, default size is 8M
max_packet_size = 8M
# crash log path
# searchd will (try to) log crashed query to 'crash_log_path.PID' file
# optional, default is empty (do not create crash logs)
#
# crash_log_path = /var/log/sphinxsearch/crash
# max allowed per-query filter count
# optional, default is 256
max_filters = 256
# max allowed per-filter values count
# optional, default is 4096
max_filter_values = 4096
# socket listen queue length
# optional, default is 5
#
# listen_backlog = 5
# per-keyword read buffer size
# optional, default is 256K
#
# read_buffer = 256K
# unhinted read size (currently used when reading hits)
# optional, default is 32K
#
# read_unhinted = 32K
}
# --eof--
sphinx itself - which is generally considered to be searchd (the deamon) doesnt run queries ever.
indexer is the tool that actually runs queries, and has no facility to automaticly run. ie it only runs when something invokes it.
Are you sure you didnt add it to cron/crontab - even just for testing?