Open Resources【公開資源】

Github Repositories

Tohoku NLP Group Github Repositoris

Visit our GitHub repositories for our latest resources. / 最新のリソースはGithubレポジトリをご覧ください。

Our latest updates include the following:

See more at Tohoku NLP Group Github Repositoris

Open Tools / 公開ツール

VecDCS: A Tool for Learing Distributional Representation for Dependency-based Compositional Semantics / 依存構造で意味的に構成可能な分散表現 VecDCS

This tool is used to learn semantically compositional word vector and syntactic transformation matrix from a dependency-analyzed corpus. Please read the following paper for the details.


glimvec: A Compositional Knowledge Graph Completion Model / 知識ベース埋め込みによる知識ベース補完モデル glimvec

der-network: A Reading Comprehension Model that Leverages Dynamic Entity Representation / dynamic entity representationによるMachine Readingモデル der-network

jawikif: A Named Entity Linker for Japanese Wification / 日本語wikificationシステム jawikif

PoEC: An English Academic Writing Assistant Search Engine / 英作文支援のための用例検索システム PoEC

The system retrieves instances of a given expression and their respective context from a large academic corpus. Since the corpus used here consists of international conference papers on Natural Language Processing (NLP), it would be a useful tool to facilitate the academic writing for NLP international conferences. Please read PDF for the details.


PoEC (

Phillip: A High Speed Reasoning Engine / 仮説推論エンジン Phillip

The high speed reasoning engine (implemented with C++) is used to make logical inferences based on integer linear programming.

整数線形計画法にもとづく高速な仮説推論エンジン ([1] の C++ 実装)

[1] Kazeto Yamamoto, Naoya Inoue, Kentaro Inui, Yuki Arase and Jun’ichi Tsujii. Boosting the Efficiency of First-order Abductive Reasoning Using Pre-estimated Relatedness between Predicates. International Journal of Machine Learning and Computing, Vol. 5, No. 2, pp. 114-120, April 2015. (DOI: 10.7763/IJMLC.2015.V5.493)

normalizeNumexp: A Tool for Numerical and Temporal Expression Normalization / 数量表現・時間表現の規格化ツール normalizeNumexp

This tool is used to identify Japanese numerical and temporal expressions and convert them into a machine readable representation.


ChaPAS: A Japanese Predicate-argument Structure Analyzer / 日本語述語項構造解析器 ChaPAS

This tool is used to identify intra-sentence predicate-argument structure from Japanese text.


Zunda: A Japanese Extended Modality Analyzer / 日本語拡張モダリティ解析器 Zunda

This tool is used to analyze an intra-sentence event (expressed as a verb or an adjective) in terms of the degree of reliability, certainty and subjectivity.


Open Datasets / 公開リソース

Pretrained Japanese BERT Models / 日本語BERT訓練済みモデル

All the models are trained on Japanese Wikipedia. There are two types of models: MeCab + WordPiece tokenization based model and Japenese character tokenization based model.
The pretrained models are now included in Transformers by Hugging Face. You can use the models in the same way as other models in Transformers.
Please see this for the details.

日本語版 Wikipedia をコーパスに用いて訓練した、汎用言語モデル BERT の訓練済みモデルです。
MeCab (ipadic) と WordPiece を用いて単語分割したモデルと、文字単位で単語分割したモデルの2種類を公開しています。
これらのモデルは、Hugging Face による自然言語処理ライブラリ Transformers でも訓練済みモデルとして利用可能です。


An Answerablity Annotated Reading Comprehension Dataset / 解答可能性付き読解データセット

The reading comprehension dataset consists of 56,651 question-answer-article triplets, where each question-answer-article triplet is manually annotated with the answerability score that indicates how confident the question could be answered based on the given article in the triplet.
Specifically, given about 12,000 buzzer quiz Question-Answer(QA) pairs, we automatically collect (at most 5) related Wikipedia articles (or paragraphs) for each QA pair. Then, for each question-answer-article triplet, we apply a crowdsourcing platform to annotate the answerability score.
Please see this for the details.

およそ12000件の早押しクイズの問題と正解に対して、関連する Wikipedia 記事段落(最大5件)の文章を機械的に付与し、それぞれの問題・正解・文章の組に対して、読解による解答可能性のスコアをクラウドソーシングによって付与しました。


Automatically Generated Lyrics Labelled with Topic Transitions / トピック遷移構造付き生成歌詞データ

The dataset consists of 100 artificial lyrics, each of them is automatically generated in response to a given sequence of word numbers (indicating sequence of mora numbers) and topic transition (representing the development of storyline).
Please see this for the details.



A Japanese Wikification Corpus / 日本語Wikificationコーパス

The corpus consists of 340 newspaper articles (PN subcorpus in BCCWJ) collected from the Extended Named Entity (ENE) annotated corpus, and the ENEs are aligned with their corresponding Wikipedia entries.
The corpus is proposed for the downstream research and evaluation related to Japanese Entity Linking and Wikification.
Please see this for the details.

日本語に対するEntity Linking, Wikificationの開発や評価に利用されることを想定して構築されました。


Japanese Wikipedia Entity Vectors / 日本語 Wikipedia エンティティベクトル

Wikipedia Entity Vectors are the distributed vector representations of words and named entities, and are learned from a preprocessed Japanese Wikipedia corpus via applying skip-gram algorithms.
Please see this for the details.

本データは、日本語版 Wikipedia の記事本文全文から学習した、単語、および Wikipedia で記事となっているエンティティの分散表現ベクトルです。


A Wikipedia Corpus Annotated with Excitatory and Inhibitory Relations / Wikipedia記事への促進・抑制関係付与コーパス

The corpus consists of 1,494 Wikipedia abstracts, and is manually annotated with Excitatory and Inhibitory relations via applying a crowdsourcing platform (10 annotators per abstract).
Please see this for the details.



A Japanese Twitter Dataset for Evaluation Expression and Target Extraction Task / 評価対象-評価表現抽出用日本語Twitterデータセット

The dataset is created from Japanese Twitter, and annotated with evaluation expression and target as well as their corresponding lexical spans, for the purpose of evaluation expression and target extraction task.
Please see this for the details.



A Japanese Functional Expressions Annotated Corpus / 機能表現タグ付与コーパス

This corpus is created from the Yahoo! Answers, a subset of Balanced Corpus of Contemporary Written Japanese (BCCWJ), and annotated with the semantic levels of Functional Expressions (FEs). The semantic levels of FEs are the extension based on this. The annotation scheme is also available here.

Since the sentences of the corpus have been annotated with Modality Tags by Matsuyoshi et al.,, the corpus is also useful for the research of modality analysis.
In order to use the corpus, it is required to separately sign the contract for using the offline version of BCCWJ corpus. Please see this about BCCWJ.

Please see this for the details.




Inter-clause Restrictive Relation Corpus 1.0 / 文節間限定関係コーパス1.0

In order to identify weak-contradictory relationship between sentences, it is necessary to identify the restrictive relationship (expressing condition, degree and scope) between clauses, which is called the task of Inter-clause Restrictive Relation Detection. The corpus is proposed for this task, and manually annotated with the inter-clause restrictive relations. The annotation scheme is detailed below.


To access the corpus, please contact inuilabresources at



本コーパスの提供につきましては inuilabresources at までメールにてご連絡ください.

Statement Map Corpus / 言論マップコーパス

This corpus is created for the Statement Map, a project designed to help users navigate the vast amounts of information on the internet, which consists of 20 types of user queries and the retrieved documents related to the queries, and annotated with semantic relationships.
Please see this for the details.



Japanese ESP Dictionary / 事象選択述語辞書

The dictionary lists the predicates that can affect the modality of an event (i.e., the predicate “stop” can affect the event “run” in “stop running”), and describes their specific effect on the event modality.
Please see this for the details.

事象のモダリティに影響を与える述語(例 走るのをやめる)について、その一覧とモダリティへの影響についてまとめた辞書です。


Japanese Sentiment Polarity Dictionary / 日本語評価極性辞書

Japanese Sentiment Polarity Dictionary (Verb) / 日本語評価極性辞書(用言編)

The dictionary includes about 5k evaluation expressions (which are extracted from the Dictionary of Evaluation Expression edited by Kobayashi), and is manually annotated with sentiment polarity information.


Japanese Sentiment Polarity Dictionary (Noun) / 日本語評価極性辞書(名詞編)

The dictionary includes about 8.5k (compound) nouns that convey sentiment polarity, and is annotated with sentiment polarity information, which has been manually checked for quality.


Please see this for the details.


A Geographical Entities Annotated Corpus / 場所参照表現タグ付きコーパス

The corpus consists of Japanese microblog text that are randomly extracted from Twitter, where the lexical expression representing an actual geographic location is manually annotated with its corresponding geographical entity. The corpus is created for the purpose of developing and evaluating an entity linking system and geographic parsing system (such as GeoNLP).
Please see this for the details.



Last-modified: 2023-02-14 (Tue) 21:44:17 (34d)

Recent Changes