Lucene5.x 集成中文分词库 IKAnalyzer

Lucene5.x 集成中文分词库 IKAnalyzer

  •  2019 年 3 月 25 日
  •  487
  •  Java Lucene IKAnalyzer 

转自https://blog.csdn.net/isea533/article/details/50186963

由于IKAnalyzer使用的是4.x版本的Analyzer接口,该接口和5.x版本不兼容,因此,如果想要在5.x版本中使用IKAnalyzer,我们还需要自己来实现5.x版本的接口。通过看源码,发现需要修改两个接口的类。

修改Tokenizer

我们写一个IKTokenizer5x继承Tokenizer:

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.util.AttributeFactory;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

import java.io.IOException;

/**
 * ik-analyzer支持Lucene5.x
 */
public class IKTokenizer5x extends Tokenizer {
    private IKSegmenter _IKImplement;
    private final CharTermAttribute termAtt = (CharTermAttribute) this.addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = (OffsetAttribute) this.addAttribute(OffsetAttribute.class);
    private final TypeAttribute typeAtt = (TypeAttribute) this.addAttribute(TypeAttribute.class);
    private int endPosition;

    public IKTokenizer5x() {
        this._IKImplement = new IKSegmenter(this.input, true);
    }

    public IKTokenizer5x(boolean useSmart) {
        this._IKImplement = new IKSegmenter(this.input, useSmart);
    }

    public IKTokenizer5x(AttributeFactory factory) {
        super(factory);
        this._IKImplement = new IKSegmenter(this.input, true);
    }

    public final boolean incrementToken() throws IOException {
        this.clearAttributes();
        Lexeme nextLexeme = this._IKImplement.next();
        if (nextLexeme != null) {
            this.termAtt.append(nextLexeme.getLexemeText());
            this.termAtt.setLength(nextLexeme.getLength());
            this.offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
            this.endPosition = nextLexeme.getEndPosition();
            this.typeAtt.setType(nextLexeme.getLexemeTypeString());
            return true;
        } else {
            return false;
        }
    }

    public void reset() throws IOException {
        super.reset();
        this._IKImplement.reset(this.input);
    }

    public final void end() {
        int finalOffset = this.correctOffset(this.endPosition);
        this.offsetAtt.setOffset(finalOffset, finalOffset);
    }

}

该类只是在IKTokenizer基础上做了简单修改,和原方法相比修改了public IKTokenizer(Reader in, boolean useSmart)这个构造方法,不再需要Reader参数。

修改Analyzer

创建类IKAnalyzer5x继承Analyzer:

import org.apache.lucene.analysis.Analyzer;

/**
 * ik-analyzer支持Lucene5.x
 */
public class IKAnalyzer5x extends Analyzer {

    private boolean useSmart;

    public boolean useSmart() {
        return this.useSmart;
    }

    public void setUseSmart(boolean useSmart) {
        this.useSmart = useSmart;
    }

    public IKAnalyzer5x() {
        this(false);
    }

    public IKAnalyzer5x(boolean useSmart) {
        this.useSmart = useSmart;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        IKTokenizer5x _IKTokenizer = new IKTokenizer5x(this.useSmart);
        return new TokenStreamComponents(_IKTokenizer);
    }

}

这个类的接口由protected TokenStreamComponents createComponents(String fieldName, Reader in)变成了protected TokenStreamComponents createComponents(String fieldName)。方法的实现中使用了上面创建的IKTokenizer5x。定义好上面的类后,在Lucene中使用IKAnalyzer5x即可。

使用方法

Analyzer analyzer = new IKAnalyzer5x();

扫一扫分享到微信

已有 条评论
写评论