Class EMBLFormat

  • All Implemented Interfaces:
    SequenceFormat, RichSequenceFormat

    public class EMBLFormat
    extends RichSequenceFormat.HeaderlessFormat
    Format reader for EMBL files. This version of EMBL format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.EmblLikeFormat object.

    This format will read both Pre-87 and 87+ versions of EMBL. It will also write them both. By default, it will write the most recent version. If you want an earlier one, you must specify the format by passing one of the constants defined in this class to writeSequence(Sequence, String, Namespace).

    Since:
    1.5
    Author:
    Richard Holland, Jolyon Holdstock, Mark Schreiber
    • Field Detail

      • EMBL_PRE87_FORMAT

        public static final java.lang.String EMBL_PRE87_FORMAT
        The name of the Pre-87 format
        See Also:
        Constant Field Values
      • EMBL_FORMAT

        public static final java.lang.String EMBL_FORMAT
        The name of the current format
        See Also:
        Constant Field Values
      • DATABASE_XREF_TAG

        protected static final java.lang.String DATABASE_XREF_TAG
        See Also:
        Constant Field Values
      • REFERENCE_POSITION_TAG

        protected static final java.lang.String REFERENCE_POSITION_TAG
        See Also:
        Constant Field Values
      • REFERENCE_XREF_TAG

        protected static final java.lang.String REFERENCE_XREF_TAG
        See Also:
        Constant Field Values
      • FEATURE_HEADER_TAG

        protected static final java.lang.String FEATURE_HEADER_TAG
        See Also:
        Constant Field Values
      • START_SEQUENCE_TAG

        protected static final java.lang.String START_SEQUENCE_TAG
        See Also:
        Constant Field Values
      • END_SEQUENCE_TAG

        protected static final java.lang.String END_SEQUENCE_TAG
        See Also:
        Constant Field Values
      • dp

        protected static final java.util.regex.Pattern dp
      • lp

        protected static final java.util.regex.Pattern lp
      • lpPre87

        protected static final java.util.regex.Pattern lpPre87
      • vp

        protected static final java.util.regex.Pattern vp
      • rpp

        protected static final java.util.regex.Pattern rpp
      • dbxp

        protected static final java.util.regex.Pattern dbxp
      • readableFileNames

        protected static final java.util.regex.Pattern readableFileNames
      • headerLine

        protected static final java.util.regex.Pattern headerLine
    • Constructor Detail

      • EMBLFormat

        public EMBLFormat()
    • Method Detail

      • canRead

        public boolean canRead​(java.io.File file)
                        throws java.io.IOException
        Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in EMBL format if its name contains the word eem or edat, or the first line matches the EMBL format for the ID line.
        Specified by:
        canRead in interface RichSequenceFormat
        Overrides:
        canRead in class RichSequenceFormat.BasicFormat
        Parameters:
        file - the File to check.
        Returns:
        true if the file is readable by this format, false if not.
        Throws:
        java.io.IOException - in case the file is inaccessible.
      • guessSymbolTokenization

        public SymbolTokenization guessSymbolTokenization​(java.io.File file)
                                                   throws java.io.IOException
        On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.
        Specified by:
        guessSymbolTokenization in interface RichSequenceFormat
        Overrides:
        guessSymbolTokenization in class RichSequenceFormat.BasicFormat
        Parameters:
        file - the File object to guess the format of.
        Returns:
        a SymbolTokenization to read the file with.
        Throws:
        java.io.IOException - if the file is unrecognisable or inaccessible.
      • canRead

        public boolean canRead​(java.io.BufferedInputStream stream)
                        throws java.io.IOException
        Check to see if a given stream is in our format. A stream is in EMBL format if its first line matches the EMBL format for the ID line.
        Parameters:
        stream - the BufferedInputStream to check.
        Returns:
        true if the stream is readable by this format, false if not.
        Throws:
        java.io.IOException - in case the stream is inaccessible.
      • guessSymbolTokenization

        public SymbolTokenization guessSymbolTokenization​(java.io.BufferedInputStream stream)
                                                   throws java.io.IOException
        On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.
        Parameters:
        stream - the BufferedInputStream object to guess the format of.
        Returns:
        a SymbolTokenization to read the stream with.
        Throws:
        java.io.IOException - if the stream is unrecognisable or inaccessible.
      • readSequence

        public boolean readSequence​(java.io.BufferedReader reader,
                                    SymbolTokenization symParser,
                                    SeqIOListener listener)
                             throws IllegalSymbolException,
                                    java.io.IOException,
                                    ParseException
        Read a sequence and pass data on to a SeqIOListener.
        Parameters:
        reader - The stream of data to parse.
        symParser - A SymbolParser defining a mapping from character data to Symbols.
        listener - A listener to notify when data is extracted from the stream.
        Returns:
        a boolean indicating whether or not the stream contains any more sequences.
        Throws:
        IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
        java.io.IOException - if an error occurs while reading from the stream.
        ParseException
      • readRichSequence

        public boolean readRichSequence​(java.io.BufferedReader reader,
                                        SymbolTokenization symParser,
                                        RichSeqIOListener rlistener,
                                        Namespace ns)
                                 throws IllegalSymbolException,
                                        java.io.IOException,
                                        ParseException
        Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.
        Parameters:
        reader - the input source
        symParser - the tokenizer which understands the sequence being read
        rlistener - the listener to send sequence events to
        ns - the namespace to read sequences into.
        Returns:
        true if there is more to read after this, false otherwise.
        Throws:
        IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
        java.io.IOException - if there was a read error.
        ParseException
      • writeSequence

        public void writeSequence​(Sequence seq,
                                  java.io.PrintStream os)
                           throws java.io.IOException
        writeSequence writes a sequence to the specified PrintStream, using the default format.
        Parameters:
        seq - the sequence to write out.
        os - the printstream to write to.
        Throws:
        java.io.IOException
      • writeSequence

        public void writeSequence​(Sequence seq,
                                  java.lang.String format,
                                  java.io.PrintStream os)
                           throws java.io.IOException
        writeSequence writes a sequence to the specified PrintStream, using the specified format.
        Parameters:
        seq - a Sequence to write out.
        format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
        os - a PrintStream object.
        Throws:
        java.io.IOException - if an error occurs.
      • writeSequence

        public void writeSequence​(Sequence seq,
                                  Namespace ns)
                           throws java.io.IOException
        Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as EMBL has no concept of it.
        Parameters:
        seq - the sequence to write
        ns - the namespace to write it with
        Throws:
        java.io.IOException - in case it couldn't write something
      • getDefaultFormat

        public java.lang.String getDefaultFormat()
        getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
        Returns:
        a String.