mecabwrap is yet another Python interface to MeCab Morphological Analyzer.
It is designed to work seamlessly both on Unix and Windows machine.
Ubuntu
$ sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
Mac OSX
$ sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
Windows
Download and run the installer.
See also: official website
The package is now on PyPI, so can be installed by pip
command:
$ pip install mecabwrap
Or, the latest development version can be installed from the GitHub.
$ git clone --depth 1 https://github.com/kota7/mecabwrap-py.git
$ cd mecabwrap-py
$ pip install .U
Following command will print the MeCab version. Otherwise, you do not have MeCab installed or MeCab is not on the search path.
$ mecab -v
# should result in `mecab of 0.996` or similar.
To verify that the package is successfully installed, try the following:
$ python
>>> from mecabwrap import tokenize
>>> for token in tokenize(u"すもももももももものうち"):
... print(token)
...
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
from mecabwrap import tokenize
for token in tokenize('すもももももももものうち'):
print(token)
To configure the MeCab calls, one may use do_
functions that support arbitrary number of MeCab options.
Currently, the following three do_
functions are provided.
do_mecab
: works with a single input text.do_mecab_vec
: works with a multiple input texts.do_mecab_iter
: works with a multiple input texts and returns a generator.For example, following code invokes the wakati option, so the outcome be words separated by spaces with no meta information. See the official site for more details.
from mecabwrap import do_mecab
out = do_mecab('人生楽ありゃ苦もあるさ', '-Owakati')
print(out)
The exapmle below uses do_mecab_vec
to parse multiple texts.
Note that -F
option configures the outcome formatting.
from mecabwrap import do_mecab_vec
ins = ['春はあけぼの', 'やうやう白くなりゆく山際', '少し明かりて', '紫だちたる雲の細くたなびきたる']
out = do_mecab_vec(ins, '-F%f[6](%f[1]) | ', '-E...ここまで\n')
print(out)
When the number of input text is large, then holding the outcomes in the memory may not be a good idea. do_mecab_iter
function, which works for multiple texts, returns a generator of MeCab results.
When byline=True
, chunks are separated by line breaks; a chunk corresponds to a token in the default setting.
When byline=False
, chunks are separated by EOS
; hence a chunk corresponds to a sentence.
from mecabwrap import do_mecab_iter
ins = ['春はあけぼの', 'やうやう白くなりゆく山際', '少し明かりて', '紫だちたる雲の細くたなびきたる']
print('\n*** generating tokens ***')
i = 0
for text in do_mecab_iter(ins, byline=True):
i += 1
print('(' + str(i) + ')\t' + text)
print('\n*** generating tokenized sentences ***')
i = 0
for text in do_mecab_iter(ins, '-E', '(文の終わり)', byline=False):
i += 1
print('---(' + str(i) + ')\n' + text)
To write the MeCab outcomes directly to a file, one may either use -o
option or outpath
argument. Note that this does not work with do_mecab_iter
, since it is designed to write the outcomes to a temporary file.
do_mecab('すもももももももものうち', '-osumomo1.txt')
# or,
do_mecab('すもももももももものうち', outpath='sumomo2.txt')
with open('sumomo1.txt') as f:
print(f.read())
with open('sumomo2.txt') as f:
print(f.read())
import os
# clean up
os.remove('sumomo1.txt')
os.remove('sumomo2.txt')
# this does not create a file
do_mecab_iter(['すもももももももものうち'], '-osumomo3.txt')
os.path.exists('sumomo3.txt')
When input text is longer than the input buffer size (default: 8192), MeCab automatically split it into two "sentences", by inserting an extra EOS (and a few letters are lost around the separation point).
As a result, do_mecab_vec
and do_mecab_iter
might produce output of length longer than the input.
The functions provide two workarounds for this (v0.2.3
or later):
auto_buffer_size
is True
, the input-buffer-size
option is automatically adjusted to the level as large as covering all input text. Note that it won't work when the input size exceeds the MeCab's maximum buffer size, 8192 * 640
~ 5MB.trancate
is True
, input text is truncated so that they are covered by the input buffer size.Note that do_mecab
does not have these features.
import warnings
x = 'すもももももももものうち!' * 225
print("input buffer size =", len(x.encode()))
with warnings.catch_warnings(record=True) as w:
res1 = list(do_mecab_iter([x]))
# the text is split into two since it exceeds the input buffer size
print("output length =", len(res1))
print('***\nEnd of the first element')
print(res1[0][-150:])
print('***\nBeginning of the second element')
print(res1[1][0:150])
import re
res2 = list(do_mecab_iter([x], auto_buffer_size=True))
print("output length =", len(res2))
print('***\nEnd of the first element')
print(res2[0][-150:])
# count the number of '!', to confirm all 223 repetitions are covered
print('number of "!" =', len(re.findall(r'!', ''.join(res2))))
print()
res3 = list(do_mecab_iter([x], truncate=True))
print("output length =", len(res3))
print('***\nEnd of the first element')
print(res3[0][-150:])
# count the number of '!', to confirm some are not covered due to trancation
print('number of "!" =', len(re.findall(r'!', ''.join(res3))))
All text inputs are assumed to be unicode.
In Python2, inputs must be u''
string, not ''
.
In python3, str
type is unicode, so u''
and ''
are equivalent.
o1 = do_mecab('すもももももももものうち') # this works only for python 3
o2 = do_mecab(u'すもももももももものうち') # this works both for python 2 and 3
print(o1)
print(o2)
The functions takes mecab_enc
option, which indicates the encoding of the MeCab dictionary being used. Usually this can be left as the default value None
, so that the encoding is automatically detected. Alternatively, one may specify the encoding explicitly.
# show mecab dict
! mecab -D | grep charset
print()
o1 = do_mecab('日本列島改造論', mecab_enc=None) # default
print(o1)
o2 = do_mecab('日本列島改造論', mecab_enc='utf-8') # explicitly specified
print(o2)
#o3 = do_mecab('日本列島改造論', mecab_enc='cp932') # wrong encoding, fails