I The purposefeature of LIBRARY OF LIFE (I denote it as LL) 0

By Emily Cruz,2014-08-28 23:51
7 views 0
I The purposefeature of LIBRARY OF LIFE (I denote it as LL) 0

    I. The purpose/feature of LIBRARY OF LIFE (I denote it as LL):

    0. LL is a search engine, whose search space is the internet, other search engine (like Google), and all the public/private databases (like Protein Data Bank) accessible via internet legally. In principle, LL does not need its own database, but may locally cache some of the frequently-requested information. The „software levelity‟ of LL is

    higher than that of search engines or public database tools because LL sits on top of them. LL makes use of the search engines and web databases to give results to a query.

1. There are so many free biological database tools like PDB, NCBI‟s Entrez,

    SwissProt, etc. So why do we need to add another database tool to the list?

    @ LL will provide a unique service set to users that those databases do not provide. Those databases are “low-level” in that their interface is fully specialized, complex, and hard to understand for non-professional users who need those data. For instance, if you open PDB webpage, there are simply too many links that one needs to follow to get the data, and to many input slots to enter a query.

    LL will provide a simple interface and hide all the details. This confirms to the information hiding principle of higher level software. One of the reason that Google is a more popular search engine than msn or yahoo is that Google has the simplest, the cleanest front page (it only has the “Google” logo and one input slot. It has no list of links, or fancy pictures as MSN or Yahoo has). LL will achieve this goal.

    2. There are so many free-of-charge, simple and convenient search engines like

    Google. So why do we add another search engine to the list?

    @ If database tools like PDB, Entrez are too complex, Google is too simple. The search results are not organized. So the users of Google needs to take „extra‟ time and effort to go through several search result, refine them and then extract the information they need. LL does all these „extra‟ work for the users.

     The biggest problem with google is the secularity, or non-scholasticity. If you type “elephant” in Google, what you get in the first page is movies or music albums named “elephant.”

3. Who provide money to maintain LL?

@ Advertisements, like from drug companies, etc.

II. Mechanism of LL

    1. LL‟s front webpage (i.e. is minimally simple: it has only

    “Library of Life” logo and one input slot.

2. A user types “elephant” and hit enter.

    3. LL recognizes “elephant” as “species” and displays the following choices:

    global distribution of the elephant,

    pictures of the elephant,

    sounds of the elephant,

    encyclopedic explanation about the elephant,

    genome of the elephant,

    proteome of the elephant (the list of proteins in an elephant‟s body).

4. The user clicks on “pictures of the elephant.”

5. LL internally asks Google about “elephant picture”, refine Google‟s results, return

    them in a ranked order.

Other examples would be:

    User asks “Charles Darwin”. LL recognizes it as “biologist” and display the choices: biography, contribution in biology, books written by him. User asks “protein kinase”. LL recognizes it as “terminology”, and display the

    choices: definition, related journal, popular website.

User asks “hemoglobin B”. LL recognized it as “biomolecule”, and display choices:

    3-dimensional structure, species that has this molecule in their body.

III. Structure of LL

    1. LL has 3 main components: conceptual category, attribute set, and master


2. All the words that are used in biology context are classified into „conceptual

    categories (denoted as CC).‟ E.g.

    Conceptual category 1: species

    Conceptual category 2: terminology

    Conceptual category 3: biomolecule

3. Each conceptual category has an attribute set. E.g.

    CC 1 (species)‟s attribute set: global distribution of the species, pictures of the

    species, sounds of the species, encyclopedic explanation about the species, its

    genome, its proteome.

    CC 2 (terminology)‟s attribute set: definition of the term, journals related to the

    term, popular websites related to the term.

    CC 3 (biomolecule)‟s attribute set: 3D structure, chemical formula, related journal.

    4. The master dictionary is a hash table whose entry is (word, conceptual category).

    E.g. (elephant, species), (evolution, terminology), (Charles Darwin, biologist).

5. When LL gets a query “elephant”, LL computes the hash value of the “elephant”

    and look it up the master dictionary to find out what conceptual category it

    belongs to.

    6. Once LL notices that elephant belongs to the species category, LL displays the

    attribute set of the species category:

    global distribution of the species, pictures of the species, sounds of the species,

    encyclopedic explanation about the species, genome, proteome.

7. According to user‟s choice of the attribute, LL consult the appropriate search

    engine, or web database tool. E.g. if user chooses “genome,‟ LL consult NCBI‟s


    IV. Query with multiplewords

    1. Assume user input is several words, like “genotype‟s relationship with


    2. LL get rid of the words like “relationship” or “with,” since they are not biological

    terms, hence not in Master Dictionary (which is a hashtable with fast lookup for

    entry existence).

    3. The dictionary entry has one more field of “domain,” like genetics domain,

    bioinformatics domain, biophysics domain. E.g. some possible dictionary entries

    are (elephant, species, zoology), (protein folding, terminology, biophysics). 4. LL maintains attribute set for each domain. For genetics domain, attribute set

    is: …

    5. LL identifies the each domain of each input word.

    6. If most of all words‟ domains are the same, say 4 out of 5 words‟ domains are

    biophysics, take that domain as the representative domain of the query. Then in

    7. Otherwise, consider each domain as representative.

V. Extension & Discussion

    LL‟s core concept such as Master dictionary, attribute set can be applied to other areas, to

    build Library of Business, Library of Social Science, Library of Art and Music, etc. Or, we can build a Universal Master dictionary whose entries are (word, cc, as, AREA). Upon query, first decide the Area of the query words and consider only the area‟s DB.

    The biggest difference between Library of X and search engines like Google is that Library of X is designed to facilitate non-secular, scholastic queries. In Google, if you enter common biological terms like “aging,” the first thing you get is too secular results like anti-aging crème. In Google, more specialized terms like “protein kinase” does give scholastic results, but they are not organized at all, and user must visit several sites to see if the sites have information they need. Since the ranking is almost random, it is probable that the site that has the information the user needs is ranked low, and user simply dismisses it because it is ranked low. LL solve these two problem by filtering Google results to get only “scholarly” sites, or categorizing the results into the displayed choices so that user can more easily find the information they need.

Comments on pdf presentation


    I disagree with the hierachical organization, since it looks logical and beautiful but practically unnecessary and useless. I would suggest to divide DB by areas of Genetics, zoology, ecology, etc. An Genetics_Orginism relation belong to Genetics area: Genetics_Organism (oid, gid, pid) where gid is gene id, pid is proteome id. In Zoology area, we have the relation:

    ecology_organism(oid, location, population).

    Most queries do not cross areas. Most queries combines data within one area. E.g. a query “get groups of species that share 10% of their genome” can be done using relations within Genetics area.


    Keep track of personal preference, just like Amazon does: a list of “you may be interested in this books/equipments” “you maybe interested in this paper/theory/journal/data”

    Also, LL provides “collaborator suggestion” service: “you may want to ask/interested in this group/person/institution” to those members who subscribed in this service. Also provide “forming groups” like yahoo! Group does.


     Build separate tables for different organisms

     Bacterium have different properties

     Canines have different properties (e.g., height)limit to gene info.

     (Oid, gid) select R1.oid, R2.oid From Orgene R1,R2 where

    R1.oid !=R2.oid AND R1.gid = R2.gid. Partition relations into

    genus. In another relation, (Oid, size). Or (oid, gid, size). 12.

     Devise new family of relational schemas that can cope with diversity

    totally meaningless! In biological context, you only need gene info. For

    height of a dog, make & use diff database. In practice, no one ever

    correlate gene info to height. This is not a market demand.


     How do we build query support for exploratory queries?db need not answer this.

    Simply google gene, phenotype. Or, in more-than-google LL, partition words into

    domain (genetics, bioinformatics). Given multiple words, let LL guess what

    domain it belongs to. In master dictionary, add another field: (word, conceptual

    category, domain) if input is multiple words, ignore cc, only consider domain. For

    each domain, make AS. And offer user the choice.


     Is gene production (usually measured in microarrays) related to environmental

    conditions (temperature, draught, season, shade, radiation, pollution, etc.)? What

    are the most important environmental parameters? you mean gene expression.

    You got microarray DB. It‟s easier to use both user‟s ability and DB‟s ability.

    Don‟t let DB do all the work. DB only provide tools to facilitate user‟s research.

    You cannot tailor DB to every possible user‟s research.


     Given high correlation of gene expression in a class of genes. Are environmental

    conditions affecting the correlation (rather than the average amplitude of

    productions between individual genes)?unnecessary and too specialized query

    types. DB should be limited to only provide general info and basic. It is user‟s

    responsibility to combine further and tailor it for the purpose of their research.

Report this document

For any questions or suggestions please email