Търсене в естествен език

  • Published on
    20-Mar-2016

  • View
    55

  • Download
    3

DESCRIPTION

. . . academy . devbg . org. www.devbg.org. . - PowerPoint PPT Presentation

Transcript

  • academy.devbg.orgwww.devbg.org

  • inverted index inverted index MySQL, Microsoft SQL Server Oracle

  • , substring

    : London police release photos of terror suspects

    lease , substring

  • : mouse, mouse, mice... : C++

  • , string

  • (effectiveness) (precision) / (recall) / (efficiency)

  • ,

    () () () ()

  • inverted index: , , ,

    (1); (3);(1); (2); (4);(2); (4);(3);

  • inverted index, :

    Inverted file index , Full inverted index , , Inverted index

  • inverted index: -> ( )t1 (d1), (d2), -> ; (- )t1 (d1, tf1), (d2, tf2), -> ; ( )t1(d1, p1), (d2, p2),

  • For each document d in the collectionBeginnumSubSet = 1While memory exists:For each term t in document dFind term t in the term dictionaryIf term t exists, add a node to its posting listOtherwise, add term t to the term dictionaryWrite SubSet of Inverted index to disknumSubSet = numSubSet + 1Free memoryEndFor I = 1 to numSubSetMerge SubSet I with Inverted Index

  • inverted index,

    + + -

  • inverted index,

    B-+ - -

  • inverted index,

    + + - -

  • inverted index,

    Hashtable+ , - ( wildcard )

  • inverted index

  • inverted index

    ,

  • DatastoreFilterSectionerLexerIndexing engine

  • inverted index

  • inverted index : preprocess

  • inverted index preprocess

    inverted index 50%.

    the 7% !

  • inverted index

    consider; considerable; considerably; considerate; considerateness; consideration; considering; conservancy; conservation; conservationist; conservatism; conservative

    consider[4]able[4]ably[3]are[7]ateness[5]ation[3]ingconverva[3]ncy[4]tion[7]tionist[4]tism[4]tive

  • inverted index

    Di Di Di-1

    ... 5600; 5679; 5684; 5685; 5780 ... 79; 5; 1; 5;

    10-15%

  • inverted index :

  • inverted index 1, 2, ... N - () 1 100% 2 80 % 3 60%

  • MySQL, Microsoft SQL Server Oracle

  • MySQL 4.1MySQL 4.1MATCH (col1,col2,...) AGAINST (expr [IN BOOLEAN MODE | WITH QUERY EXPANSION]) :col1, col2 ... colN char, varchar text MyISAMxpr : + - ( ) ~ *

  • MySQL 4.1Boolean search mode scoreScore 1 ( score )

  • MySQL 4.1MySQL : test != tests C++ - 3 , fulltext (Lexer) (Filter)

  • MySQL 4.1CREATE TABLE t_text (id INTEGER PRIMARY KEY, title TEXT, text TEXT);

    ALTER TABLE t_text ADD (FULLTEXT(title,text));

    REPAIR TABLE t_text QUICK;

  • MySQL 4.1

    SELECT id, MATCH (col1[,col2]) AGAINST (exp1') AS score FROM articles WHERE MATCH (col1[,col2]) AGAINST (exp1')

  • Microsoft SQL Server 2000Microsoft SQL Server 2000 Fulltext ServiceCONTAINS(column,expr)CONTAINSTABLE(table, column, expr[, N )FREETEXT(column,expr)FREETEXTTABLE(table, column, expr[, N ) column * ( )

  • Microsoft SQL Server 2000 expr CONTAINS CONTAINSTABLE

  • Microsoft SQL Server 2000Microsoft SQL Server 2000 : fulltext C++ fulltext backup (Lexer) (Filter)

  • Microsoft SQL Server 2000 fulltext sp_fulltext_catalog ( )sp_fulltext_database ( fulltext )sp_fulltext_service ( )sp_fulltext_table ( )

  • Microsoft SQL Server 2000SELECT key, rankFROM CONTAINSTABLE(t_tree,text,expr, 100)

    SELECT id FROM t_tree WHERE CONTAINS(text, expr)

  • OracleOracle Database 10.1.0.2SCORE(id) CONTAINS(column, expr[, id]) > 0CATSEARCH(column, expr, structured query) > 0MATCHES(column, text); expr AND, OR, NOT, wildcard % _, NEAR, ABOUTstructured query - text

  • OracleOracle Text :Datastore , ( ), URLFilter ( )Lexer ( )

  • OracleCREATE TABLE t_text ( id NUMBER PRIMARY KEY, title VARCHAR2(4000), text CLOB)CREATE INDEX i_title ON t_text(title) INDEXTYPE IS CTXSYS.CONTEXTSELECT id, score(1) FROM t_text WHERE CONTAINS(text,expr',1) > 0

  • Jarkarta LuceneAPI Java .NET , , , ( ) Lexer Filter (Analyzer)

  • IndexWriter Analyzer Lexer FilterDocument ; name/value Query QueryParser Query

  • :

    IndexWriter index = new IndexWriter(myFile, myAnalyzer, true); // AnalyzerDocument doc = new Document();doc.add(Field.Text(field1,value1));doc.add(Field.Text(dbid,id, false));index.addDocument(doc); //

  • :

    Searcher searcher = new IndexSearcher(myFile);Query myQuery = QueryParser.parse(expr, field, myAnalyzer);Hits hits = searcher.search(myQuery);hits.doc(0).get(dbid);

  • ?