mecabwrap

A Python Interface to MeCab for Unix and Windows

mecabwrap is yet another Python interface to MeCab Morphological Analyzer.

It is designed to work seamlessly both on Unix and Windows machine.

Requirement

  • Python 2.6+ or 3.4+
  • MeCab 0.996

Installation

1. Install MeCab

Ubuntu

$ sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8

Mac OSX

$ sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8

Windows

Download and run the installer.

See also: official website

2. Install this Package

The package is now on PyPI, so can be installed by pip command:

$ pip install mecabwrap

Or, the latest development version can be installed from the GitHub.

$ git clone --depth 1 https://github.com/kota7/mecabwrap-py.git
$ cd mecabwrap-py
$ pip install .U

Quick Check

Following command will print the MeCab version. Otherwise, you do not have MeCab installed or MeCab is not on the search path.

$ mecab -v
# should result in `mecab of 0.996` or similar.

To verify that the package is successfully installed, try the following:

$ python
>>> from mecabwrap import tokenize
>>> for token in tokenize(u"すもももももももものうち"): 
...     print(token)
... 
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
   助詞,係助詞,*,*,*,*,,,
もも  名詞,一般,*,*,*,*,もも,モモ,モモ
   助詞,係助詞,*,*,*,*,,,
もも  名詞,一般,*,*,*,*,もも,モモ,モモ
   助詞,連体化,*,*,*,*,,,
うち  名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Usage

A Simple Tokenizer

The tokenize function is a high level API for splitting a text into tokens. It returns a generator of tokens.

In [1]:
from mecabwrap import tokenize

for token in tokenize('すもももももももものうち'):
    print(token)
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Using MeCab Options

To configure the MeCab calls, one may use do_ functions that support arbitrary number of MeCab options.
Currently, the following three do_ functions are provided.

  • do_mecab: works with a single input text.
  • do_mecab_vec: works with a multiple input texts.
  • do_mecab_iter: works with a multiple input texts and returns a generator.

For example, following code invokes the wakati option, so the outcome be words separated by spaces with no meta information. See the official site for more details.

In [2]:
from mecabwrap import do_mecab
out = do_mecab('人生楽ありゃ苦もあるさ', '-Owakati')
print(out)
人生 楽 ありゃ 苦 も ある さ 

The exapmle below uses do_mecab_vec to parse multiple texts. Note that -F option configures the outcome formatting.

In [3]:
from mecabwrap import do_mecab_vec
ins = ['春はあけぼの', 'やうやう白くなりゆく山際', '少し明かりて', '紫だちたる雲の細くたなびきたる']

out = do_mecab_vec(ins, '-F%f[6](%f[1]) | ', '-E...ここまで\n')
print(out)
春(一般) | は(係助詞) | あけぼの(固有名詞) | ...ここまで
やうやう(一般) | 白い(自立) | なる(自立) | ゆく(非自立) | 山際(一般) | ...ここまで
少し(助詞類接続) | 明かり(一般) | て(格助詞) | ...ここまで
紫(一般) | だ() | ちる(自立) | たり() | 雲(一般) | の(連体化) | 細い(自立) | たなびく(自立) | たり() | ...ここまで

Returning Iterators

When the number of input text is large, then holding the outcomes in the memory may not be a good idea. do_mecab_iter function, which works for multiple texts, returns a generator of MeCab results. When byline=True, chunks are separated by line breaks; a chunk corresponds to a token in the default setting. When byline=False, chunks are separated by EOS; hence a chunk corresponds to a sentence.

In [4]:
from mecabwrap import do_mecab_iter

ins = ['春はあけぼの', 'やうやう白くなりゆく山際', '少し明かりて', '紫だちたる雲の細くたなびきたる']

print('\n*** generating tokens ***')
i = 0
for text in do_mecab_iter(ins, byline=True):
    i += 1
    print('(' + str(i) + ')\t' + text)
    
print('\n*** generating tokenized sentences ***')
i = 0
for text in do_mecab_iter(ins, '-E', '(文の終わり)', byline=False):
    i += 1
    print('---(' + str(i) + ')\n' + text)
*** generating tokens ***
(1)	春	名詞,一般,*,*,*,*,春,ハル,ハル
(2)	は	助詞,係助詞,*,*,*,*,は,ハ,ワ
(3)	あけぼの	名詞,固有名詞,地域,一般,*,*,あけぼの,アケボノ,アケボノ
(4)	EOS
(5)	やうやう	副詞,一般,*,*,*,*,やうやう,ヤウヤウ,ヨーヨー
(6)	白く	形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,白い,シロク,シロク
(7)	なり	動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ
(8)	ゆく	動詞,非自立,*,*,五段・カ行促音便ユク,基本形,ゆく,ユク,ユク
(9)	山際	名詞,一般,*,*,*,*,山際,ヤマギワ,ヤマギワ
(10)	EOS
(11)	少し	副詞,助詞類接続,*,*,*,*,少し,スコシ,スコシ
(12)	明かり	名詞,一般,*,*,*,*,明かり,アカリ,アカリ
(13)	て	助詞,格助詞,連語,*,*,*,て,テ,テ
(14)	EOS
(15)	紫	名詞,一般,*,*,*,*,紫,ムラサキ,ムラサキ
(16)	だ	助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
(17)	ち	動詞,自立,*,*,五段・ラ行,体言接続特殊2,ちる,チ,チ
(18)	たる	助動詞,*,*,*,文語・ナリ,体言接続,たり,タル,タル
(19)	雲	名詞,一般,*,*,*,*,雲,クモ,クモ
(20)	の	助詞,連体化,*,*,*,*,の,ノ,ノ
(21)	細く	形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,細い,ホソク,ホソク
(22)	たなびき	動詞,自立,*,*,五段・カ行イ音便,連用形,たなびく,タナビキ,タナビキ
(23)	たる	助動詞,*,*,*,文語・ナリ,体言接続,たり,タル,タル
(24)	EOS

*** generating tokenized sentences ***
---(1)
春	名詞,一般,*,*,*,*,春,ハル,ハル
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
あけぼの	名詞,固有名詞,地域,一般,*,*,あけぼの,アケボノ,アケボノ
(文の終わり)
---(2)
やうやう	副詞,一般,*,*,*,*,やうやう,ヤウヤウ,ヨーヨー
白く	形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,白い,シロク,シロク
なり	動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ
ゆく	動詞,非自立,*,*,五段・カ行促音便ユク,基本形,ゆく,ユク,ユク
山際	名詞,一般,*,*,*,*,山際,ヤマギワ,ヤマギワ
(文の終わり)
---(3)
少し	副詞,助詞類接続,*,*,*,*,少し,スコシ,スコシ
明かり	名詞,一般,*,*,*,*,明かり,アカリ,アカリ
て	助詞,格助詞,連語,*,*,*,て,テ,テ
(文の終わり)
---(4)
紫	名詞,一般,*,*,*,*,紫,ムラサキ,ムラサキ
だ	助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
ち	動詞,自立,*,*,五段・ラ行,体言接続特殊2,ちる,チ,チ
たる	助動詞,*,*,*,文語・ナリ,体言接続,たり,タル,タル
雲	名詞,一般,*,*,*,*,雲,クモ,クモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
細く	形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,細い,ホソク,ホソク
たなびき	動詞,自立,*,*,五段・カ行イ音便,連用形,たなびく,タナビキ,タナビキ
たる	助動詞,*,*,*,文語・ナリ,体言接続,たり,タル,タル
(文の終わり)

Writing the outcome to a file

To write the MeCab outcomes directly to a file, one may either use -o option or outpath argument. Note that this does not work with do_mecab_iter, since it is designed to write the outcomes to a temporary file.

In [5]:
do_mecab('すもももももももものうち', '-osumomo1.txt')
# or,
do_mecab('すもももももももものうち', outpath='sumomo2.txt')

with open('sumomo1.txt') as f: 
    print(f.read())
with open('sumomo2.txt') as f: 
    print(f.read())

import os
# clean up
os.remove('sumomo1.txt')
os.remove('sumomo2.txt')

# this does not create a file
do_mecab_iter(['すもももももももものうち'], '-osumomo3.txt')
os.path.exists('sumomo3.txt')
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Out[5]:
False

Very Long Input and Buffer Size

When input text is longer than the input buffer size (default: 8192), MeCab automatically split it into two "sentences", by inserting an extra EOS (and a few letters are lost around the separation point). As a result, do_mecab_vec and do_mecab_iter might produce output of length longer than the input.

The functions provide two workarounds for this (v0.2.3 or later):

  1. If the option auto_buffer_size is True, the input-buffer-size option is automatically adjusted to the level as large as covering all input text. Note that it won't work when the input size exceeds the MeCab's maximum buffer size, 8192 * 640 ~ 5MB.
  2. If the option trancate is True, input text is truncated so that they are covered by the input buffer size.

Note that do_mecab does not have these features.

In [6]:
import warnings

x = 'すもももももももものうち!' * 225
print("input buffer size =", len(x.encode()))

with warnings.catch_warnings(record=True) as w:
    res1 = list(do_mecab_iter([x]))
# the text is split into two since it exceeds the input buffer size
print("output length =", len(res1))

print('***\nEnd of the first element')
print(res1[0][-150:])

print('***\nBeginning of the second element')
print(res1[1][0:150])
output would contain extra EOS
input buffer size = 8325
output length = 2
***
End of the first element
モ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
記号,一般,*,*,*,*,*
EOS
***
Beginning of the second element
記号,一般,*,*,*,*,*
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
すもも	名詞,一般
In [7]:
import re

res2 = list(do_mecab_iter([x], auto_buffer_size=True))
print("output length =", len(res2))

print('***\nEnd of the first element')
print(res2[0][-150:])

# count the number of '!', to confirm all 223 repetitions are covered
print('number of "!" =', len(re.findall(r'!', ''.join(res2))))

print()
res3 = list(do_mecab_iter([x], truncate=True))
print("output length =", len(res3))

print('***\nEnd of the first element')
print(res3[0][-150:])

# count the number of '!', to confirm some are not covered due to trancation
print('number of "!" =', len(re.findall(r'!', ''.join(res3))))
output length = 1
***
End of the first element
も	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
EOS
number of "!" = 225

output length = 1
***
End of the first element
モ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
記号,一般,*,*,*,*,*
EOS
number of "!" = 221

Note on Python 2

All text inputs are assumed to be unicode.
In Python2, inputs must be u'' string, not ''. In python3, str type is unicode, so u'' and '' are equivalent.

In [8]:
o1 = do_mecab('すもももももももものうち')   # this works only for python 3
o2 = do_mecab(u'すもももももももものうち')  # this works both for python 2 and 3
print(o1)
print(o2)
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Note on dictionary encodings

The functions takes mecab_enc option, which indicates the encoding of the MeCab dictionary being used. Usually this can be left as the default value None, so that the encoding is automatically detected. Alternatively, one may specify the encoding explicitly.

In [9]:
# show mecab dict
! mecab -D | grep charset
print()

o1 = do_mecab('日本列島改造論', mecab_enc=None)      # default
print(o1)

o2 = do_mecab('日本列島改造論', mecab_enc='utf-8')   # explicitly specified
print(o2)

#o3 = do_mecab('日本列島改造論', mecab_enc='cp932')   # wrong encoding, fails
charset:	UTF-8

日本	名詞,固有名詞,地域,国,*,*,日本,ニッポン,ニッポン
列島	名詞,一般,*,*,*,*,列島,レットウ,レットー
改造	名詞,サ変接続,*,*,*,*,改造,カイゾウ,カイゾー
論	名詞,接尾,一般,*,*,*,論,ロン,ロン
EOS

日本	名詞,固有名詞,地域,国,*,*,日本,ニッポン,ニッポン
列島	名詞,一般,*,*,*,*,列島,レットウ,レットー
改造	名詞,サ変接続,*,*,*,*,改造,カイゾウ,カイゾー
論	名詞,接尾,一般,*,*,*,論,ロン,ロン
EOS