DOC

httpjavasuncom

By Andrea Fox,2014-05-21 12:20
12 views 0
httpjavasuncom

    Introduction to Minorthird:

    Getting Started Tutorial

    Part I: Installing Minorthird

Before you can download minorthird, java and ant must be installed. Java can be found

    at http://java.sun.com. Make sure that you download and install the SDK. Ant can be downloaded at http://ant.apache.org .

Once you have downloaded both java and ant: (Note: Directions on how to do the

    following are below under “Setting Environment Variables”)

    1. Define JAVA_HOME

    2. Define ANT_HOME

    3. Add $JAVA_HOME/bin and $ANT_HOME/bin to your PATH

Setting Environment Variables (including your PATH)

    Windows:

    1. Follow: Start ? Settings ? Control Panel

    2. If you are using Windows XP, click on Switch to Classic View on the left

    3. Double click on System

    4. Click on the Advanced tab

    5. Click the Environment Variables button near the bottom

    6. The top box should be labeled user variables and there may or may not be a

    variable already labeled PATH, JAVA_HOME or ANT_HOME. If the variable

    you are looking for is already there, highlight it and click Edit; otherwise click

    New, and add the variable name and location.

    7. Add the complete path of your java bin and ant bin to your PATH separated by

    semicolons. Your addition should look something like this:

    JAVA_HOME\bin;ANT_HOME\bin

Linux (Cygwin):

    Type these commands:

    1. % export JAVA_HOME=c:/j2sdk1.4.2_04 (or wherever you put it)

    2. % export ANT_HOME=c:/apache-ant-1.6.1 (or wherever you put it)

    3. % export PATH=$JAVA_HOME/bin:$ANT_HOME/bin:PATH

Anyone can download minorthird using either sourceforge or CVS. Downloading from

    sourceforge requires less setup and is a little easier to get working. However the version

    of minorthird on sourceforge is only updated every one to two months and must be

    downloaded completely each time up want to update. I recommend the sourceforge

    version to people who do not have time to set up and are not familiar with CVS and are

    not interested in updating every week. CVS is good for people who already understand

    CVS, want to update frequently, and are possibly interested in submitting some changes.

    Downloading from sourceforge:

    1. Go to: http://sourceforge.net/projects/minorthird/

    2. If you would like to look at all the source code, click on the regular minorthird

    source tree. For faster downloading and set up, click on minorthird-jar.

    Important: you will not be able to view source code or make changes if you only

    download the jar.

    3. Click on either the zip file or the jar file that you would like to download.

    Downloading from CVS:

    Note: If you would like to submit changes to minorthird email William Cohen,

    wcohen@cs.cmu.edu, and Cammie Williams, cammie@cmu.edu, to get a

    personal account on raff.

    1. Download CVS, we suggest using tortoise CVS, which can be obtained at

    http://www.tortoisecvs.org/. (If you are using cygwin, make sure to add CVS to

    your path and that ssh is installed. Note: ssh is not installed under the default

    packages in the cygwin setup, it is suggested that you install all cygwin packages.) 2. Move to the directory you would like minorthird in

    3. Set these environment variables:

    a. CVS_RSH=ssh

    b. cvsroot=:pserver:anonymous@raff.cald.cs.cmu.edu:/usr1/cvsroot 4. Type these commands:

    a. cvs login

    b. cvs checkout minorthird

    c. cvs update dP (the dP is needed to remove deleted-from-cvs files and

    directories)

    5. Alternatively you can type:

    a. Export CVS_RSH=ssh

    b. cvs d :ext:anonymous@raff.cald.cs.cmu.edu:usr1/cvsroot checkout

    minorthird

    6. The minorthird source tree should appear in your current directory.

    Running Minorthird

    1. Set the MINORTHIRD variable

    2. cd to MINORTHIRD

    3. Run the setup script which sets you classpath:

    a. In Windows type: script\setup

    b. In Cygwin type: source script/setup.sh

    c. In Linux type: source script/setup.linux 4. To compile the code type:

    a. %ant build-clean

    5. To compile the javadocs type:

    a. % ant javadoc

    6. To run the tests to make sure things are working type:

    a. % ant tests

    Part II: The Basics and Labels

MinorThird is a collection of Java classes for storing text, annotating text, and learning to

    extract entities and categorize text. While there are several packages for using

    Minorthird in this tutorial, we are first going to explore the UI package.

Find out what you can do in the UI package by typing:

    % java Xmx500M edu.cmu.minorthird.ui.Help

This command will print a list of the programs you can run using Minorthird. To get

    more detail about each program and its parameters, you can type:

    % java Xmx500M edu.cmu.minorthird.ui.[programName] -help

Every program you can run in Minorthird requires and dataset. Datasets can contain

    labels, which can be stored in either a labels document or embedded in the document in

    the form of XML tags. Labels are applied to spans, which are series of adjacent tokens

    that can range in size from a single token to an entire document.

Example of text labels in a labels document:

    addToType afp+20040521+germany_internet_spam 184 3 extracted_date

    addToType afp+20040521+germany_internet_spam 189 11 extracted_date

    addToType afp+20040525+stocks_us_ge_genworth 205 3 extracted_date

In a labels document the second word is the document name, the third word is the starting

    token of the span, the fourth word is the length of the span, and the last word is the label.

Example of a document with embedded labels:

    German gov't probing surge in spam e-mails (AFP)

    http://us.rd.yahoo.com/dailynews/rss/spam/*http://story.news.yahoo.com/news?tm

    pl=story2&u=/afp/20040521/tc_afp/germany_internet_spam

    Fri, 21 May 2004 13:38:17 GMT

In this type of document a labeled span lies between the < > and markers. The label

    is the word between the marks.

To get a better feel for what is happening in Minorthird, you can look at the javadocs. To

    construct the javadocs type: % ant javadoc. Also an older version of the docs are on

    William‟s website or on http://minorthird.sourceforge.net/

Notes:

    ? what you are able to mark up is defined by TextLabels API

    ? spanTypes are the "most central" construct, and tokenProps and others not as well

    supported (saving, viewing, etc)

    ? the metaphor is that the toolkit is a programming language for annotations.

    o As a programming language, need subroutines and libraries, which are

    invoked textLabels.require(), return output by adding annotations, and

    completion/status information in textLabels.isAnnotatedBy().

We are going to use the sample dataset small-newsdir for this tutorial. This dataset is

    labeled with XML tags. This dataset can be found on William‟s website (http://wcohen.com) under Teaching, Slides, notes, and materials for day 1. To look at the documents and labels in the small-newsdir dataset type:

    % java Xmx500M edu.cmu.minorthird.ui.ViewLabels labels small-newsdir

After a few seconds, a window like this will appear:

You can expand the window by clicking and dragging and corner and you can expand or

    shrink an section by clicking and dragging the section dividers. To see how each

    document has been labeled, click on the SpanTypes tab in the upper right section of the

    window. Now the upper right section of the window will look like this:

    Click on the select color- menu to select a color for

    highlighting and click on the select type- menu to choose a

    label you would like to highlight. Make sure to highlight

    different labels in different colors. Click the Apply button at

    the bottom of the section to view this highlighting.

To view an entire article, click on the article in the upper left section. The article will

    appear in the lower section. To see the complete text of an article choose the text tab in

    the bottom window and to see the token of the article click on the tokens tab.

Part III: Mixup

Minorthird‟s language for manipulating text is mixup (Minorthird Information eXtraction

    and Understanding Program.) Sample mixup programs can be found on William‟s website (http://wcohen.com) under Teaching, Slides, notes, and sample files from the first day‟s lecure. Here is a sample program (sample1.mixup) with commentary:

defSpanType source1 = title: ... '(' [ ... ] ')' ;

In this line of mixup, defSpanType source1 defines source1 as the spanType which is

    defined to the right of the equal sign. The expression to the right of the equal sign

    defines the pattern where source1 can be identified. This line expresses that source1 is in

    the title between the parentheses. Here is a list of what each part of the expression means:

    defSpanType - keyword

    source1 - name of the defined spanType

    title: - start with title and match to the pattern defined in the

    remainder of the expression - anything

    „(„ - the left parenthesis token

    [ - START

     - anything

    ] - END

    „)‟ - the right parenthsis token

defSpanType source2 = description: [ !'-'+R ] '-' ... ;

This line of mixup is very similar to the line above, but contains a few new expressions:

    ! - not this token

    + - 1+ times

    R - Extend to the right

To see the parameters for running a mixup program type:

    % java Xmx500M edu.cmu.minorthird.ui.RunMixup help

Now lets try running a sample mixup program. To do this make sure the sample

    programs are in you minorthird/lib/mixup directory. Try:

    % java Xmx500M edu.cmu.minorthird.ui.RunMixup labels small-newsdir mixup

    sample1.mixup showResult

     - The showResult parameter will graphically display the output

     OR

    % java Xmx500M edu.cmu.minorthird.ui.RunMixup labels small-newsdir mixup

    sample1.mixup gui

    - Press the “Start Task” button to run the program

When the program is done running a window like this will appear:

This window looks similar to the one that appeared when you ran View labels; however,

    you will notice that there are now 6 span types rather than 4 since sample1.mixup defined

    two more span types: source1 and source2. To see what the mixup program extracted, try

    going to the SpanTypes tab and highlighting source1 and source2.

Sample1a.mixup demonstrates what happens if a mixup expression contains + instead of

    +R. Unlike other languages which extend patterns greedily, mixup takes each pattern

    literally and backtracks as needed. To see how this works run:

    %java Xmx500M edu.cmu.minorthird.ui.RunMixup labels small-newdir mixup

    sample1a.mixup showResult saveAs foo.labels

Note: -saveAs FILE means save as some computer readable format, and it works for

    most ui programs.

When the window appears, highlight source2s. Knowing the source2 is any prefix that

    ends before a „-„, you can see how this does not work right. Now try running

    sample1.mixup again and see how it does work right with the +R rather than just the +.

The lessons from these two sample mixup programs are:

    1) Use L and R prefixes for expressions that can match, when you can

    2) Use non-determinism when you need to

    a. Ex: defSpanType bigram =description: ... [any any ] ... ;

    Another example: sample2.mixup take a look then run: % java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup

    sample2.mixup showResult

Now lets take a look at some annotators:

    1) Open sample3.mixup (don‟t look at it yet)

    2) Run: java Xmx500M edu.cmu.minorthird.ui.RunMixup labels small-newsdir -

    mixup sample3.mixup showResult

    a. This will take a while…

    3) Now take a look at sample3.mixup

    a. „require asks for some type of annotation

    b. Annotators are found usually in $MINORTHID/lib/mixup

    c. Annotators can be re-defined in “annotators.config” which is usually in

    $MINORTHIRD/config/annotators.config

    4) When RunMixup is finished running, we will save the computation to save time

    later on. To do this, click the SaveAs button at the bottom middle of the top left

    window (you will have to scroll to get there.) Note: File->SaveAs does not work

    in this case, it is only for serializable objects.

    5) Now pick out some useful tags and save them in small-newsdir.labels

    % perl -ane "print if

    $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Na

    me|NNP)/" sample3.labels | grep addToType | cut -d" " -f5 | sort | uniq -c

    % perl -ane "print if

    $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Na

    me|NNP)/" sample3.labels > small-newsdir.labels

    % java -Xmx500M edu.cmu.minorthird.ui.ViewLabels -labels small-newsdir

    Note: To find labels for labels FOO(1) look in repository (2) look for directory FOO

    (3) look for FOO.labels for markup, and ignore in-line markup

Part IV: The Mixup Debugger and LabelEditor

Debugging Mixup gives you the ability to edit your labels and your labeling program in

    parallel. To see how this works, copy saved-handLabeled.labels to handLabeled.labels

    and try:

% java Xmx500M edu.cmu.minorthird.ui.DebugMixup labels small-newsdir edir

    handLabeled.labels mixup sample5.mixup

A window that looks like this will appear (without the highlighting at first)

To highlight extracted companies (which were defined by the mixup program), select

    extracted_company from the first pull down menu on the section divider. All the

    extracted companies will turn yellow (you may have to scroll down a little to find any.)

    Then to view the true companies, which were defined by handLabeled.labels, select

    true_company from the second pull down menu. All hand labeled companies that were

    properly extracted by the mixup program will turn green, all companies that were missed

    by the mixup program will turn blue, and false positives will turn red. (See above picture

    for reference.)

To edit the labels, click on a document, and click the Import button at the bottom of the

    window. This will import all the extracted company labels. To correct these labels click

    the Next button and Delete if it is a false positive. To add a label, highlight the span and

    click Add. When you are finished labeling a document, click Export. Click save when

    you finish.

Some Tricks:

    1) On RHS of the center bar, replace -top- with -body- to focus the window to what

    you care about.

    2) Replace -top- with -extracted company- and move the slide to look for

    extractions-in-context.

When you're close enough with the debugging, you might want to hand

    the task over to someone else to get more training data. First run the

    current program:

% java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup

    sample5.mixup -saveAs sample5.labels

Now take the relevant part of its output, and your hand-labeling results,

    and merge them:

% grep extracted_company sample5.labels > labelingTask.labels

    % cat handLabeled.labels >> labelingTask.labels

Now run the labelling tool (which is somewhat stripped down, probably

    not enough), on the result:

% java -Xmx500M edu.cmu.minorthird.ui.EditLabels -labels small-newsdir -edit

    labelingTask.labels -extractedType extracted_company -trueType true_company

Part V: Extraction Learning

Now that you are done labeling some data, you can try running an extraction experiment

    with the data that you have created. You can view the labels that you have created by

    concatenating all the labels your have saved in the file small-newdir.labels and running:

    % java Xmx500M edu.cmu.minorthird.ui.ViewLabels labels small-newsdir

    Note: If minorthird is not loading the labels file check that your labels file is exactly the same as the name of the data directory with .labels appended to the end. Also, make sure

    that your labels file is in the same directory as you data directory.

This should look a lot like what was in the label editor. Now lets try learning from the

    data:

    % java Xmx500M edu.cmu.minorthird.ui.TrainTestExtractor gui

    Note: Almost all the ui tools support the gui option, which allows you to explore what is possible to do. However, this graphical user interface is not yet quite ideal to use.

There are a lot of defaults for a TestTrainExtractor experiment, but you need to at least

    specify where the data is and what you want to learn. You can do this in the gui by:

    ? Follow: Edit ? baseParameters ? labelFilename and hit Browse and select the

    small-newsdir directory

    o

    ? Close all these windows with “oks”

    ? Now that you have specified the data, you can click the Show labels button, which

    should bring up a window similar to the one you saw when you ran ViewLabels.

    Verify that the data is correct.

    ? Follow: Edit ? signalParameters ? spanType and select true_company (what to

    learn)

    o

    ? Follow: Edit ? splitterParameters and check “showTestDetails” for more output

    for debugging

    o

    ? Follow: Edit ? trainingParameters ? learner and check

    displayDatasetBeforeLearning

    ? Hit “Start Task” Button… the evaluation will take a minute or two

The SequentialDataset window is what Minorthird reduces the extraction

    problem to: POS/NEG examples for each word, depending on whether or

    not it's in the select spanType.

Report this document

For any questions or suggestions please email
cust-service@docsford.com