使用MMAnalyzer 搜索出现一些问题
|
zhanjianhua
2008-07-11
最近新学了LUCENE 发现MMAnalyzer分词后有好多英文没办法查出,不知道是不是所说的stop word ,如果是那应该怎么做才能让它在分词时保留那些单词,以下是我代码,请大家看看有什么方法能查到结果,当然将new MMAnalyzer改成SimpleAnalyzer是能搜索出来的,除此还有其他方式没,
package ch2.lucenedemo.test;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import jeasy.analysis.MMAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import ch2.lucenedemo.process.Search;
public class SearchTimeCompareTest {
public static void main(String[] args) {
Search search = new Search();
SearchTimeCompareTest st = new SearchTimeCompareTest();
st.getSearch();
}
public void getSearch() {
RAMDirectory ram = new RAMDirectory();
IndexWriter writer;
try {
writer = new IndexWriter(ram,new SimpleAnalyzer(),true);
Document doc1 = new Document();
doc1.add(new Field("content","DIDO Thank you ",Field.Store.YES,Field.Index.TOKENIZED));
Document doc2 = new Document();
doc2.add(new Field("content","HERE and NOW 恭硕良",Field.Store.YES,Field.Index.TOKENIZED));
writer.addDocument(doc1);
writer.addDocument(doc2);
writer.flush();
writer.close();
IndexSearcher searcher = new IndexSearcher(ram);
BooleanQuery bq = new BooleanQuery();
QueryParser parser = new QueryParser("content", new MMAnalyzer());
Query query = parser.parse("NOW");
bq.add(query, BooleanClause.Occur.SHOULD);
System.out.println(bq.toString());
Hits hits = searcher.search(bq);
System.out.println(hits.length());
if (hits.length() > 0)
for (int j = 0; j < hits.length(); j++) {
Document doc = hits.doc(j);
System.out.println(" " + hits.length() + " " + doc.get("content"));
}
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (LockObtainFailedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
执行时为什么打印出 () 0 是因为索引去掉stop word, |
|
|
zhanjianhua
2008-07-21
可能我的问题没说请楚,就是当我用了网上提供的分词器时,会屏蔽很多英文和中文,而LUCENE自带StandardAnalyzer类是提供这么构造函数的Analyzer analyzer = new StandardAnalyzer(stopWord ),而我用的MMAnalyzer貌似没有提供,或者提供类似这样的一个中文分词器,
String str = "中文分词器"; 将以上的内容分成: 中,文,分,词,器,中文,文分,分词,词器,中文分,文分词,分词器,中文分词,文分词器,中文分词器 请问网上能找到这种分词器吗?有的话给回复一个 |
|
|
chester60
2008-07-22
分得那么细的就不知道了,这样的分法很简单,你可以自己写一个.
和这种分法类似的是"庖丁解牛",在javaeye上搜索就找到了. |

