Skip to content

Getting Started

Introduction

This guide walks you through reading a text file that contains metadata, a header row and mixed data types with Tabbed.

Imports
1
2
3
4
5
6
7
import os
import tempfile
import random
from datetime import datetime, timedelta

from tabbed import samples
from tabbed.reading import Reader

Sample File

Tabbed comes preloaded with a sample text file called annotations.txt. Below we open this file to see what it looks like and develop a list of operations we would like Tabbed to handle automatically for us.

Preview Sample Data
1
2
3
4
fp = samples.paths.annotations
with open(fp, 'r') as infile:
    for line in infile:
        print(line, end='')
View Sample Data

Experiment ID Experiment
Animal ID Animal
Researcher Test
Directory path 

Number Start Time End Time Time From Start Channel Annotation
0 02/09/22 09:17:38.948 02/09/22 09:17:38.948 0.0000 ALL Started Recording
1 02/09/22 09:37:00.000 02/09/22 09:37:00.000 1161.0520 ALL start
2 02/09/22 09:37:00.000 02/09/22 09:37:08.784 1161.0520 ALL exploring
3 02/09/22 09:37:08.784 02/09/22 09:37:13.897 1169.8360 ALL grooming
4 02/09/22 09:37:13.897 02/09/22 09:38:01.262 1174.9490 ALL exploring
5 02/09/22 09:38:01.262 02/09/22 09:38:07.909 1222.3140 ALL grooming
6 02/09/22 09:38:07.909 02/09/22 09:38:20.258 1228.9610 ALL exploring
7 02/09/22 09:38:20.258 02/09/22 09:38:25.435 1241.3100 ALL grooming
8 02/09/22 09:38:25.435 02/09/22 09:40:07.055 1246.4870 ALL exploring
9 02/09/22 09:40:07.055 02/09/22 09:40:22.334 1348.1070 ALL grooming
10 02/09/22 09:40:22.334 02/09/22 09:41:36.664 1363.3860 ALL exploring
11 02/09/22 09:41:36.664 02/09/22 09:41:46.326 1437.7160 ALL grooming
12 02/09/22 09:41:46.326 02/09/22 09:44:16.857 1447.3780 ALL exploring
13 02/09/22 09:44:16.857 02/09/22 09:44:58.225 1597.9090 ALL grooming
14 02/09/22 09:44:58.225 02/09/22 09:45:35.800 1639.2770 ALL exploring
15 02/09/22 09:45:35.800 02/09/22 09:45:40.506 1676.8520 ALL grooming
16 02/09/22 09:45:40.506 02/09/22 09:47:03.165 1681.5580 ALL exploring
17 02/09/22 09:47:03.165 02/09/22 09:47:16.448 1764.2170 ALL grooming
18 02/09/22 09:47:16.448 02/09/22 09:47:55.227 1777.5000 ALL exploring
19 02/09/22 09:47:55.227 02/09/22 09:48:05.044 1816.2790 ALL grooming
20 02/09/22 09:48:05.044 02/09/22 09:51:40.919 1826.0960 ALL exploring
21 02/09/22 09:51:40.919 02/09/22 09:51:47.331 2041.9710 ALL grooming
22 02/09/22 09:51:47.331 02/09/22 09:52:20.626 2048.3830 ALL exploring
23 02/09/22 09:52:20.626 02/09/22 09:52:29.406 2081.6780 ALL grooming
24 02/09/22 09:52:29.406 02/09/22 09:53:07.268 2090.4580 ALL exploring
25 02/09/22 09:53:07.268 02/09/22 09:53:21.147 2128.3200 ALL grooming
26 02/09/22 09:53:21.147 02/09/22 09:54:19.752 2142.1990 ALL exploring
27 02/09/22 09:54:19.752 02/09/22 09:54:38.782 2200.8040 ALL grooming
28 02/09/22 09:54:38.782 02/09/22 09:56:30.491 2219.8340 ALL exploring
29 02/09/22 09:56:30.491 02/09/22 09:56:40.306 2331.5430 ALL grooming
30 02/09/22 09:56:40.306 02/09/22 09:57:11.920 2341.3580 ALL exploring
31 02/09/22 09:57:11.920 02/09/22 09:57:18.783 2372.9720 ALL grooming
32 02/09/22 09:57:18.783 02/09/22 10:00:02.036 2379.8350 ALL exploring
33 02/09/22 10:00:02.036 02/09/22 10:00:08.325 2543.0880 ALL resting
34 02/09/22 10:00:08.325 02/09/22 10:01:57.278 2549.3770 ALL exploring
35 02/09/22 10:01:57.278 02/09/22 10:02:17.993 2658.3300 ALL grooming
36 02/09/22 10:02:17.993 02/09/22 10:03:04.118 2679.0450 ALL exploring
37 02/09/22 10:03:04.118 02/09/22 10:03:04.118 2725.1700 ALL stop
38 02/09/22 10:17:30.082 02/09/22 10:17:30.082 3591.1340 ALL Stopped Recording

Tabbed Wish List

To read files like this, we desire Tabbed to support the following:

Header Detection

This sample file contains a metadata section prior to the header on line 7. Metadata can be unstructured like a paragraph or structured into columns separated by a delimiter. We want Tabbed to automatically detect the Metadata section and Header line of any file.

Type Inference

The string cells in the sample file are encoding 4 different data types; integers, datetimes, floats and strings. We want Tabbed to perform Type inference.

Data Filtering

We want Tabbed to support simple value based row and column filtering. For example, in this file we might want only rows at which the Start Time column is less than datetime(2022, 2, 9, 9, 37, 13) or where the Annotation column has a string value of 'exploring' or both conditions.

Partial & Iterative Reading

Text files can be large. Tabbed should support partial and iterative reading.

Flexibility

Tabbed should be flexible. It should be able to start reading at any file position, skip reading of 'bad' rows, and allow users to choose how much memory to consume during iterative reading of large files.

The Tabbed Reader

Tabbed's Reader reads rows of an infile to dictionaries just like Python's built-in csv.DictReader. However, Tabbed's Reader embeds a sophisticated file Sniffer that can detect metadata, header & data sections of a file automatically (for details see Sniffer). The detected metadata, header and datatypes are available to the reader as properties. In this section, we will build a reader and see how to access the file's dialect, metadata, header, and inferred datatypes.

Building a Reader
1
2
3
4
fp = samples.paths.annotations
infile = open(fp, 'r')
# like Python's csv.DictReader, we pass an open file instance
reader = Reader(infile)
Accessing Dialect
1
2
3
4
5
fp = samples.paths.annotations
infile = open(fp, 'r')
# like Python's csv.DictReader, we pass an open file instance
reader = Reader(infile)
print(reader.sniffer.dialect)

Dialect

SimpleDialect('\t', '"', None)

The output dialect is a SimpleDialect instance of the clevercsv package.

Metadata & Header Detection
1
2
3
4
# the reader's header and metadata properties call the sniffer
print(reader.header)
print('---')
print(reader.metadata())

Metadata and Header Detection

Header(line=6, names=['Number', 'Start_Time', 'End_Time', 'Time_From_Start', 'Channel', 'Annotation'], string='Number\tStart Time\tEnd Time\tTime From Start\tChannel\tAnnotation')


MetaData(lines=(0, 6), string='Experiment ID\tExperiment\nAnimal ID\tAnimal\nResearcher\tTest\nDirectory path\t\n\n')

The Header was detected on line 6 and has 6 column names. The metadata string spans from line 0 upto line 6. The embedded Sniffer instance samples the file when the reader is created.

Type Inference
1
2
3
4
# request the sniffed types by polling the last 10 rows of the sniffed sample
# consistent is a `bool` indicating if types are consistent across sample rows
types, consistent = reader.sniffer.types(poll=10)
print(types)

Type Inference

[<class 'int'>, <class 'datetime.datetime'>, <class 'datetime.datetime'>, <class 'float'>, <class 'str'>, <class 'str'>]

Our deep testing on randomly generated text files indicates that Tabbed's Reader will detect dialect, metadata, header, and types correctly in most cases. Should you encounter a problem, you can change the sample the Sniffer uses to measure these properties. The Sniffer's start,amount, & skips alter the sniffing sample. You can also change what sample rows are used for type polling via the poll and exclude arguments of the Reader initializer. All these arguments can help in the auto-detection of the header and metadata sections of a text file. For help understanding these parameters type help(reader.sniffer) or see Sniffer. Below, we show the sniffer and it's default parameters used in this example.

Default Sniffer
#print the current sniffer used by the reader
print(reader.sniffer)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Sniffer
--- Attributes ---
infile: <_io.TextIOWr...oding='UTF-8'>
decimal: '.'
--- Properties ---
amount: 100
dialect: SimpleDialect('\t', '"', None)
lines: [0, 1, 2, 3, 4, 5, ...]
rows: [['Experiment ID', 'Experiment'], ['Animal ID', 'Animal'], ['Researcher', 'Test'], ['Directory path'], [''], [''], ...]
sample: 'Experiment I...d Recording\n'
skips: []
start: 0
--- Methods ---
datetime_formats
header
metadata
sniff
types
Type help(Sniffer) for full documentation
Default Reader
#print the poll and exlude default arguments.
print(reader.poll, reader.exclude)
1
20 ['', ' ', '-', 'nan', 'NaN', 'NAN']

Data Filtering

Tabbed provides a powerful mechanism for value-based filtering of rows and columns. These filters are called Tabs in Tabbed and support equality, membership, rich comparison, regular expression, and custom filtering of data. The reader.tab method provides a simple way to construct Tabs with keyword arguments.

Equality Tabbing

Equality Tabbing Example
1
2
3
reader.tab(Annotation='exploring', columns=['Number', 'Annotation'])
for row in chain.from_iterable(reader.read()):
    print(row)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{'Number': 2, 'Annotation': 'exploring'}
{'Number': 4, 'Annotation': 'exploring'}
{'Number': 6, 'Annotation': 'exploring'}
{'Number': 8, 'Annotation': 'exploring'}
{'Number': 10, 'Annotation': 'exploring'}
{'Number': 12, 'Annotation': 'exploring'}
{'Number': 14, 'Annotation': 'exploring'}
{'Number': 16, 'Annotation': 'exploring'}
{'Number': 18, 'Annotation': 'exploring'}
{'Number': 20, 'Annotation': 'exploring'}
{'Number': 22, 'Annotation': 'exploring'}
{'Number': 24, 'Annotation': 'exploring'}
{'Number': 26, 'Annotation': 'exploring'}
{'Number': 28, 'Annotation': 'exploring'}
{'Number': 30, 'Annotation': 'exploring'}
{'Number': 32, 'Annotation': 'exploring'}
{'Number': 34, 'Annotation': 'exploring'}
{'Number': 36, 'Annotation': 'exploring'}

For now ignore the chain.from_iterable(reader.read()) and focus on the highlihted line (1) where we tab the rows in the Annotation column whose value equals exploring and request the reader to only read the Number and Annotation columns. Notice the output row dictionaries consist of rows that match this Tabbing. For more details on Equality tabbing please see the Equality Tab

Membership Tabbing

Membership Tabbing Example
1
2
3
reader.tab(Annotation=['exploring', 'resting'], columns=[0, 5])
for row in chain.from_iterable(reader.read()):
    print(row)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
{'Number': 2, 'Annotation': 'exploring'}
{'Number': 4, 'Annotation': 'exploring'}
{'Number': 6, 'Annotation': 'exploring'}
{'Number': 8, 'Annotation': 'exploring'}
{'Number': 10, 'Annotation': 'exploring'}
{'Number': 12, 'Annotation': 'exploring'}
{'Number': 14, 'Annotation': 'exploring'}
{'Number': 16, 'Annotation': 'exploring'}
{'Number': 18, 'Annotation': 'exploring'}
{'Number': 20, 'Annotation': 'exploring'}
{'Number': 22, 'Annotation': 'exploring'}
{'Number': 24, 'Annotation': 'exploring'}
{'Number': 26, 'Annotation': 'exploring'}
{'Number': 28, 'Annotation': 'exploring'}
{'Number': 30, 'Annotation': 'exploring'}
{'Number': 32, 'Annotation': 'exploring'}
{'Number': 33, 'Annotation': 'resting'}
{'Number': 34, 'Annotation': 'exploring'}
{'Number': 36, 'Annotation': 'exploring'}

Focus on the highlihted line (1) where we tab the rows in the Annotation column whose value is in ['exploring', 'resting'] and request the reader to only read the Number and Annotation columns using column indexing. Notice the output row dictionaries consist of rows that match this Tabbing. For more details on Membership tabbing please see the Membership Tab

Comparison Tabbing

Rich Comparison Tabbing Example
1
2
3
4
# get all the annotations between 9:38:00 and 9:42:00
reader.tab(Start_Time='> 9/2/2022 9:38:00 and < 9/2/2022 9:42:00', columns=[0, 1])
for row in chain.from_iterable(reader.read()):
    print(row)
1
2
3
4
5
6
7
8
{'Number': 5, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 1, 262000)}
{'Number': 6, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 7, 909000)}
{'Number': 7, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 20, 258000)}
{'Number': 8, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 38, 25, 435000)}
{'Number': 9, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 40, 7, 55000)}
{'Number': 10, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 40, 22, 334000)}
{'Number': 11, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 41, 36, 664000)}
{'Number': 12, 'Start_Time': datetime.datetime(2022, 2, 9, 9, 41, 46, 326000)}

Again, focus on the highlihted line (2) where we tab the rows in the Start_Time column whose value is between '9:38:00' and '9:42:00' and request the reader to only read the Number and Start_Time columns using column indexing. Notice the output row dictionaries consist of rows that match this Tabbing. For more details on Comparison tabbing please see the Comparison Tab

Regular Expression Tabbing

Regular Expression Tabbing Example
1
2
3
4
5
import re
# get all the annotations that contain start with 'g' or 'r'
reader.tab(Annotation=re.compile(r'^[g|r]'), columns=[0, 1])
for row in chain.from_iterable(reader.read()):
    print(row)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
{'Number': 3, 'Annotation': 'grooming'}
{'Number': 5, 'Annotation': 'grooming'}
{'Number': 7, 'Annotation': 'grooming'}
{'Number': 9, 'Annotation': 'grooming'}
{'Number': 11, 'Annotation': 'grooming'}
{'Number': 13, 'Annotation': 'grooming'}
{'Number': 15, 'Annotation': 'grooming'}
{'Number': 17, 'Annotation': 'grooming'}
{'Number': 19, 'Annotation': 'grooming'}
{'Number': 21, 'Annotation': 'grooming'}
{'Number': 23, 'Annotation': 'grooming'}
{'Number': 25, 'Annotation': 'grooming'}
{'Number': 27, 'Annotation': 'grooming'}
{'Number': 29, 'Annotation': 'grooming'}
{'Number': 31, 'Annotation': 'grooming'}
{'Number': 33, 'Annotation': 'resting'}
{'Number': 35, 'Annotation': 'grooming'}

Focus on the highlihted line (3) where we tab the rows in the Start_Time column whose value is between '9:38:00' and '9:42:00' and request the reader to only read the Number and Start_Time columns using column indexing. Notice the output row dictionaries consist of rows that match this Tabbing. For more details on Regex tabbing please see the: Regex Tab

Custom Tabbing

Tabbed also supports construction of Calling Tabs that allow you to provide your own custom logic for row filtering. For details see the Calling Tab in the reference manual.

Reading

The Reader.read method returns an iterator of lists. Each yielded list contains row dictionaries from the data section. The values in each dict are the type casted and tab filtered rows. The chunksize parameter of the read method determines how many row dictionaries to yield per iteration. Let's take a look at the read method with our sample file.

Return Type
1
2
3
4
5
# for ease of reading just get the Number & Annotation columns
reader.tab(columns=['Number', 'Annotation'])
# calling read creates an iterator
gen = reader.read(chunksize=5)
print(type(gen))

Return Type

<class 'generator'>

Chunksize
for idx, chunk in enumerate(reader.read(chunksize=2)):
    print(f'chunk {idx}: {chunk}')
chunksize

chunk 0: [{'Number': 0, 'Annotation': 'Started Recording'}, {'Number': 1, 'Annotation': 'start'}]
chunk 1: [{'Number': 2, 'Annotation': 'exploring'}, {'Number': 3, 'Annotation': 'grooming'}]
chunk 2: [{'Number': 4, 'Annotation': 'exploring'}, {'Number': 5, 'Annotation': 'grooming'}]
chunk 3: [{'Number': 6, 'Annotation': 'exploring'}, {'Number': 7, 'Annotation': 'grooming'}]
chunk 4: [{'Number': 8, 'Annotation': 'exploring'}, {'Number': 9, 'Annotation': 'grooming'}]
chunk 5: [{'Number': 10, 'Annotation': 'exploring'}, {'Number': 11, 'Annotation': 'grooming'}]
chunk 6: [{'Number': 12, 'Annotation': 'exploring'}, {'Number': 13, 'Annotation': 'grooming'}]
chunk 7: [{'Number': 14, 'Annotation': 'exploring'}, {'Number': 15, 'Annotation': 'grooming'}]
chunk 8: [{'Number': 16, 'Annotation': 'exploring'}, {'Number': 17, 'Annotation': 'grooming'}]
chunk 9: [{'Number': 18, 'Annotation': 'exploring'}, {'Number': 19, 'Annotation': 'grooming'}]
chunk 10: [{'Number': 20, 'Annotation': 'exploring'}, {'Number': 21, 'Annotation': 'grooming'}]
chunk 11: [{'Number': 22, 'Annotation': 'exploring'}, {'Number': 23, 'Annotation': 'grooming'}]
chunk 12: [{'Number': 24, 'Annotation': 'exploring'}, {'Number': 25, 'Annotation': 'grooming'}]
chunk 13: [{'Number': 26, 'Annotation': 'exploring'}, {'Number': 27, 'Annotation': 'grooming'}]
chunk 14: [{'Number': 28, 'Annotation': 'exploring'}, {'Number': 29, 'Annotation': 'grooming'}]
chunk 15: [{'Number': 30, 'Annotation': 'exploring'}, {'Number': 31, 'Annotation': 'grooming'}]
chunk 16: [{'Number': 32, 'Annotation': 'exploring'}, {'Number': 33, 'Annotation': 'resting'}]
chunk 17: [{'Number': 34, 'Annotation': 'exploring'}, {'Number': 35, 'Annotation': 'grooming'}]
chunk 18: [{'Number': 36, 'Annotation': 'exploring'}, {'Number': 37, 'Annotation': 'stop'}]
chunk 19: [{'Number': 38, 'Annotation': 'Stopped Recording'}]

Each yield of the read iterator gave us 2 rows from the data section. You can set the chunksize to any int value. The default is 200,000 rows per yield. Read has several parameters for controlling what rows will be yielded. These include; start, skips and indices. Details on these parameters can be found using help(Reader.read) or read's documentation.

The read method always returns an iterator but for small files you may want to read the file in completely. This is simple using python's itertools module. Below is a recipe for converting read's iterator to an in-memory list.

As in-memory list
1
2
3
from itertools import chain
data = list(chain.from_iterable(reader.read(chunksize=2)))
print(*data, sep='\n')
Reading to an in-memory list

{'Number': 0, 'Annotation': 'Started Recording'}
{'Number': 1, 'Annotation': 'start'}
{'Number': 2, 'Annotation': 'exploring'}
{'Number': 3, 'Annotation': 'grooming'}
{'Number': 4, 'Annotation': 'exploring'}
{'Number': 5, 'Annotation': 'grooming'}
{'Number': 6, 'Annotation': 'exploring'}
{'Number': 7, 'Annotation': 'grooming'}
{'Number': 8, 'Annotation': 'exploring'}
{'Number': 9, 'Annotation': 'grooming'}
{'Number': 10, 'Annotation': 'exploring'}
{'Number': 11, 'Annotation': 'grooming'}
{'Number': 12, 'Annotation': 'exploring'}
{'Number': 13, 'Annotation': 'grooming'}
{'Number': 14, 'Annotation': 'exploring'}
{'Number': 15, 'Annotation': 'grooming'}
{'Number': 16, 'Annotation': 'exploring'}
{'Number': 17, 'Annotation': 'grooming'}
{'Number': 18, 'Annotation': 'exploring'}
{'Number': 19, 'Annotation': 'grooming'}
{'Number': 20, 'Annotation': 'exploring'}
{'Number': 21, 'Annotation': 'grooming'}
{'Number': 22, 'Annotation': 'exploring'}
{'Number': 23, 'Annotation': 'grooming'}
{'Number': 24, 'Annotation': 'exploring'}
{'Number': 25, 'Annotation': 'grooming'}
{'Number': 26, 'Annotation': 'exploring'}
{'Number': 27, 'Annotation': 'grooming'}
{'Number': 28, 'Annotation': 'exploring'}
{'Number': 29, 'Annotation': 'grooming'}
{'Number': 30, 'Annotation': 'exploring'}
{'Number': 31, 'Annotation': 'grooming'}
{'Number': 32, 'Annotation': 'exploring'}
{'Number': 33, 'Annotation': 'resting'}
{'Number': 34, 'Annotation': 'exploring'}
{'Number': 35, 'Annotation': 'grooming'}
{'Number': 36, 'Annotation': 'exploring'}
{'Number': 37, 'Annotation': 'stop'}
{'Number': 38, 'Annotation': 'Stopped Recording'}

When Something Goes Wrong

In most cases, we think Tabbed will work out-of-the-box on your text files but the variability in dialects and structures means we can't guarantee it. Tabbed provides several fallbacks to help you read files when something has gone wrong. Specifically there are two problems you may encounter:

Incorrect Start Row

If tab fails to detect the file's structure, the start row for the read will be incorrect. You have 2 options to deal with this.

  • Adjust the start, amount, or skipsattributes of the sniffer or the exclude parameter of the header and metadata sniffer methods. These control the sample the sniffer uses to detect the header and metadata if they exist. You can use Reader.peek to help you determine good values for these parameters.
  • Adjust the default poll and exclude arguments of a Reader instance. In particular, the exclude argument can be used to ignore missing values for better type inference.
  • During Read, set the start parameter to force reading to begin at a specific row. This will also require you to manually set the reader's header by setting reader.header to a list of header string names. This method should always work when structure (metadata, header, etc) isn't being detected.

Wonky Data Values

Tabbed supports reading ints, floats, complex, time, date and datetime types. It further assumes that these types are consistent across rows within a column in the data section. If Tabbed encounters a type conversion error, it gracefully returns the value as a string type and logs the error to the Reader.errors attribute. You can use this log to figure out what rows had problems and skip them or change the values using your own callable after they have been read by Tabbed.