This site is supported by donations to The OEIS Foundation.

User:Charles R Greathouse IV/Metadata

From OeisWiki

Jump to: navigation, search

This is a page for my thoughts about metadata (data about data) in the OEIS. In all cases the basic idea is to take some recurrent feature and expose it in some way so that it can be searched for, rendered differently, user-customized, etc.

See also Features Wishlist#Sequence Metadata for requested features and discussion.

Contents

Current state

There is a great deal of metadata in the OEIS at present. The keywords are a major part: for example, keyword:tabl allowed the addition of a table output format (see, e.g., A007318/table) and similarly with keyword:cons (e.g., A000796/constant).

Contributors to the OEIS are now tagged with underscores which cause them to be auto-linked. I do not believe it is possible to search for them (other than as plain text, as usual) but this mechanism should allow better automated parsing.

Sequence properties

See User:Charles R Greathouse IV/Properties for thoughts on properties and their relationships.

Keywords

See User:Charles R Greathouse IV/Keywords for information on individual keywords.

Keywords have been the primary form of metadata for the OEIS since its creation. In its current incarnation keywords can be searched and have title-text allowing new users to understand their meanings more readily (though many are intuitive even without this hovertext).

Index

Because there are so few keywords, the Index is the fallback method for collecting similar sequences together. Unfortunately, in its present implementation:

  • The meaning of inclusion is not well-specified. This is appropriate and useful for an index, but limits its utility when using it for other purposes. For example, quadratic form primes by discriminant: would a sequence *about* but not *of* primes of that discriminant be included? As an index, it would be useful to include such other sequences but in other contexts this may be undesirable.
  • It is hard to search for. The names are long and many do not get their own name/id attribute. Also there are entries with Index links that do not match the spelling of the Index entry.
  • It is relatively inflexible (new entries are rarely added).

Despite these drawbacks I strongly recommend adding Index links to entries. If this is the only way sequence properties are ever tagged then we should make the best of it we can. If we eventually move to a different system then the Index links can form the start of that system by some automated process.

Of course there are things that an index is good at where it should not be replaced. I see its function, ideally, as cataloging "related to" rather than "is a" relationships. A sequence "is" monotonic, a relationship better described by a keyword than an index, but a sequence related to monotonicity (say, A158939) but which need not be itself monotonic is perfect for the index. (Similarly, A083140 is not actually a permutation of the natural numbers but it should have—and does have!—an index link to permutations of the natural numbers.)

Tags

One possibility would be adding a tag field to entries. This would be a more free-form version of keywords. A tag would consist of letters, numbers, and dashes; say /\b[a-z][a-z0-9]*(-[a-z0-9]+)*\b/i. They could be searched just as keywords are, but new ones could be created easily. First thought: any Associate Editor can create a new tag by editing a (protected) page by adding the name and a description. When adding a tag to a sequence the submission form checks that page to see if it exists; if not, it rejects the submission just as it does if a nonexistent keyword is added.

A good start would be tags for different fields of math: abstract-algebra for A000001, combinatorics for A000002, number-theory for A000003, etc. (Of course narrower tags could be added instead of or in addition: group-theory, automata, binary-forms.) Or perhaps the MSC classification could be used alongside or in addition.

In terms of implementation there are optimizations possible (if enough submissions are done to cause significant server load). For example, every time the tag page is edited the server could update a trie of acceptable tags. But in the (likely?) case that tag lookups in sequence submissions are a small part of server load this could be skipped.

Other approaches

Beside the existing keywords, there are many properties that seem worth coding such as being monotone, completely multiplicative, additive, sub-/super-additive, or even "a rearrangement of \mathbb{N}". Also it would be good to have information on the recognizability of sequences: A038772 is a regular language in decimal, and a number of sequences (primes, 2^n-1, etc.) are regular in unary. Similarly, when there are results showing that a particular sequence is/is not context-free, context-sensitive, or decidable/recursive this seems worth mentioning. (Almost all sequences in the OEIS should be at least recursively enumerable; A004147 is one of the rare uncomputable sequences in the OEIS.)

I would also very much like to be able to classify the growth rates of the monotone sequences; this could lend itself to searching very well. I'm not sure what the best way to do this is; some system where the types are meaningful rather than just text would be ideal, so that adding more information would not detract from the entry.

I would also like to be able to mark sequences and sequence properties which are dubious (guessed / open / conjectured) or simply not rigorously proven yet. I prefer, ceteris paribus, to define sequences without reference to conjecture. For example, A059784 could be defined as either \lfloor k^{2^n}\rfloor or as a(n+1)=nextprime(a(n)^2). The former relies on the existence of such k ('obvious' but unproven) while the latter exists unconditionally.

Many sequences have their generating function listed in a standardized form. I'd love to be able to search for sequences by properties implied by these generating functions, like sequences with exponential growth.

Finally, there are many natural equivalence classes of sequences. It would be good to mark these somehow, probably by choosing a representative from each class. (No need for AC, since there are only finitely many sequences in the OEIS...)

Identifiers

People

OEIS contributors are now identified with their standard user name surrounded by underscores. This causes the name to be auto-linked to the user's page, and opens the door at some later point to various forms of automated processing.

It would be good to find user names not marked in this way and mark them, but this is not a high priority.

Perhaps something should be done to mark the names of other people (besides just the contributors). When searching this would make it easier to find people with names that can be spelled differently (Chebyshev), names with accented characters, names that are often abbreviated, names that change (married names? personal, religious, or cultural changes?), and so forth. It should also make it possible to disambiguate common names.

Programs

The first priority with programs is to distinguish the languages in the "other language" field. This way

  • Searching is made easier (look, for example, at the number of variants used to describe Visual Basic or Scheme, or the difficulty of searching for Maxima programs)
  • There is potential to format programs with, e.g., GeSHi or SyntaxHighlighter.
  • The entry can be formatted differently, perhaps (e.g.) showing two rows with "Python" and "MAGMA" rather than one with "Program:" in the left column
  • By exposing this content, scripting the OEIS becomes easier.

Another priority is to distinguish versions from each other. (See, for example, the issues with Maple versions between A006506 and A191779.) What runs in Mathematica 10 may not run in Mathematica 8, etc. This should support multiple versions and/or version ranges: Math'ca 6+ or Pari/GP 2.3.1–2.4.2. Ideally (but this seems more difficult) related languages could share implementations: Octave and Matlab or Excel and OOo Calc.

A low priority would be to distinguish comments from program so that they could be treated differently by, e.g., search.

Other

It could be useful to identify other things uniquely.

  • Languages: The ISO 639-3 codes, possibly together with IANA subtags per BCP 47, can be used in a tag's lang attribute. (It would be nice to be able to distinguish when sources are in Latin or French, for example.)
  • Journals: A journal may have several abbreviations or even several names in addition to the full form of its current name. Consider (not a great example...) "Mathematics of Computation" vs. "Math. Comp." vs. "Mathematical Tables and Other Aids to Computation" vs. however that was abbreviated.
  • Authors: Perhaps ORCiD would be useful?
  • Books: Different printings, translations, etc. (via ISBN?); distinguish books with similar or identical names; WorldCat, Amazon (AmazonSmile?), or other links; attach other relevant metadata concerning author, language, etc.

General metadata

Dates

The OEIS is in a relatively good position having standardized its date format as either mmm dd yyyy or mmm dd, yyyy. But dates are not easily searched: imagine trying to find sequences from the first half of 2010. That would require a complicated search: [1]. But worst, try searching for a comment from July 2010. You can't just search for comments with both 2010 and July, because that would match a sequence with one comment from July 2009 and another from April 2010.

Templating

Certain things show up frequently in the database, like links to MathWorld. In many ways it would be nice to collect these together. For example, what if one changed names, say from "World of Mathematics" to "MathWorld"? For the wiki side there is {{MathWorld}}, but nothing for the sequence side at the moment. (Actually, I'm not even sure what the recommended format for such links is now....)

Some possibilities:

  • Abramowitz & Stegun
  • MathWorld
  • Wikipedia
  • the Internet Archive
  • the EIS and HIS

See also Style Sheet#References and Talk:Style Sheet#Templates for references.

Semantics

POSH, especially rel attributes, would be good. Microformats like COinS, hCalendar, hCard would be great, though probably not a high priority. We probably already meet WCAG 2.0, but it may be worth checking. (Any accessibility experts want to chime in?)

Subject

It would be nice to have subject identifiers for sequences. Filtering sequences to look for chemistry-related sequences, or quantum physics, or number theory, or zeta functions... in general this would require constructing an appropriate ontology, but using the MSC plus ad-hoc additions for subjects outside of mathematics would probably suffice.

This could be built on the wiki side and simply linked; the category structure seems singularly appropriate, though we should probably impose a DAG requirement on the structure so that descendants and ancestors could be searched without creating loops.

Personal tools