GATE track 1 session: Difference between revisions

From zooid Wiki
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 20: Line 20:
* Learning Systems - statistical
* Learning Systems - statistical


Old Bailey IE project  - old english (Online)
Old Bailey IE project  - 17th century english (Online)


* POS - assigned in Token (noun, verb, etc)
* POS - assigned in Token (noun, verb, etc)
Line 92: Line 92:


* WebSphinx crawler CREOLE plugin
* WebSphinx crawler CREOLE plugin
* [http://www.semanticsoftware.info/semantic-assistants-architecture GATE NLP web services]


= Other notes =
= Other notes =


== Lucene data store and ANIC ==
== Lucene data store and ANNIC ==


* Use <null> for default set
* Use <null> for default set

Latest revision as of 01:10, 10 September 2010

A full week of learning GATE text mining/information extraction language processing and talks. Session wiki

GATE developer screenshot

GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available. It's 14 years old and has many users and contributors.

Using GATE developer

  • GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
  • ANNIE, VG (verb group) processors.
  • Preserve formatting embeds tags in HTML or XML.
    • Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)

Information Extraction

  • IR - retrieve docs
  • IE - retreive structured data
  • Knowledge Engineering - rule based
  • Learning Systems - statistical

Old Bailey IE project - 17th century english (Online)

  • POS - assigned in Token (noun, verb, etc)
  • Gazateer - gotcha, have to set initialization parameter listsURL before it's loaded. Must also "save and reinitialize."
  • Gazeteer creates Lookups, then transducer creaties named entities
  • Then orthomatcher (spelling features in common) coreference associates those
  • Annotation Key sets and annotation comparing
    • Need setToKeep key in Document Reset for any pre annotated texts

Evaluation / Metrics

  • Evaluation metric - mathematically against human annotated
  • Scoring - performance measures for annotation types
  • Precision = correct / correct + spurious
  • Recall = correct / correct + missing
  • F-measure is precision and recall (harmonic mean)
  • F=2⋅(precision⋅recall / precision+recall)
  • GATE supports average, strict, lenient
  • Result types - Correct, missing, spurious, partially correct (overlapped)
  • Tools > Annotations Diff - comparing human vs machine annotation
  • Corpus > Corpus quality assurance - compare by type
  • (B has to be the generated set)
  • Annotation set transfer (in tools) - transfer between docs in pipeline
    • useful for eg HTML that has boilerplate


To investigate

  • markupAware for HTML/XML (keeps tags in editor)
  • AnnotationStack
  • Advanced Options

JAPE

  • Rules based on tokens and lookups
Phase: MatchingStyles
Input: Lookup
Options: control = appelt
Rule: Test1
(
({Lookup.majorType == location})?
{Lookup.majorType == loc_key}
):match
-->
:match.Location = {rule=Test1}

Copying features: :match.Location = { type = :match.Lookup.minorType}

To review, gotchas

  • Rule types : first takes only first match, excludes compound
    • a? b for "a b" will match "a b"
  • multiplexor tranducers
  • multi-constraint statements
  • macros
  • To reuse created annotations has to be a separate rule

Matching types

Matching styles for JAPE

To follow up

Other notes

Lucene data store and ANNIC

  • Use <null> for default set
  • Go to Datastore for queries
    • eg {Person}({Token})+{Money}
  • Useful for debugging JAPE and results

GATE-lucene-person-money.png

Demos

  • Mímir for querying large volumes of data (uses MG4J)
  • Translating parts of speech between languages using Compound editor and Alignment editor
  • Predicate extractor (MultiPaX)
    • Mixed results at best
  • OwlExporter
    • NLP ontology

Conclusions

While it can do a lot out of the box and benefits from development time and breadth of connectivity, to be useful to more than patient specialists, it needs usability testing. A lot of things are inobvious and too domain specific that with a bit of work could be more broadly useful. Interaction could include a lot more immediate, useful and interesting looking displays. A web based version could have these features. However the team seems somewhat ambivalent about development. :)

Looking forward to learning about programming using GATE libraries.


Créer la version française

File:Lata Pada - Kshetram21.jpg
GATE track 1 session


Location

Toronto


Lata Pada is a Canadian choreographer and Bharatanatyam dancer of Indian descent. Pada is the Founder and Artistic Director of Sampradaya Dance Creations, a dance Company that performs South Asian dance. She is also the Founder and Director of Sampradaya Dance Academy, a leading professional dance training institution that is the only South Asian dance school in North America affiliated with the prestigious, UK-based Imperial Society for Teachers of Dancing.Pada founded the dance company in 1990 because she wanted to showcase Bharatantyam dance as an art form throughout the world.

Pada, who attended Elphinstone College in Mumbai, trained under the gurus Kalaimamani Kalyanasundaram and Padmabhushan Kalanidhi Narayanan.Pada lives in Mississauga, near Toronto. Pada married geologist Vishnu Pada when she was 17 years old.

In 1985 Lata Pada and her family decided to take an extended vacation to India. On June 23 of that year Vishnu Pada and daughters Arti and Brinda died in the bombing of Air India Flight 182. Lata Pada was not aboard since she left on an earlier date to tour India for Bharatanatyam recitals in Bangalore and across India; Lata was in Mumbai rehearsing for her tour, while her husband and daughters stayed behind in Sudbury, Ontario because Brinda was graduating from high school; afterwards the three flew on Air India 182. Lata Pada became a spokesperson for the families of the victims. After the crash she created the dance piece "Revealed By Fire" in remembrance of the incident. Pada received a master's degree in fine arts from York University in 1997.

Pada married Hari Venkatacharya in September, 2000. Venkatacharya is an entrepreneur and was Managing Director of Nytric Business Partners and is the Immediate Past President of TiE Toronto. He also serves on the Boards of the Ontario Science Centre and Fields Institute for Research in Mathematical Sciences. They both met while founding the South Asian advisory committee at the Royal Ontario Museum in 1995, where they raised over $3 million Canadian dollars for Canada's first permanent South Asian Gallery.

In December 2008, she was made a Member of the Order of Canada for her contributions to the development of Bharatanatyam as a choreographer, teacher, dancer and artistic director, as well as for her commitment and support of the Indian community in Canada. Lata was also recently appointed as Adjunct Professor in the Graduate Faculty of Dance, York University, Toronto.


This article based on content from http://www.wikipedia.org. Original version: http://en.wikipedia.org/wiki/Lata_Pada