Reducers

Question Reducer

This module provides functions to reduce the question task extracts from panoptes_aggregation.extractors.question_extractor.

panoptes_aggregation.reducers.question_reducer.question_reducer(data_list, pairs=False, track_user_ids=False, **kwargs)

Reduce a list of extracted questions into a “counter” dict

Parameters:
  • data_list (list) – A list of extractions created by panoptes_aggregation.extractors.question_extractor.question_extractor()

  • pairs (bool, optional) – Default False. How multiple choice questions are treated. When True the set of all choices is treated as a single answer

  • track_user_ids (bool, optional) – Default False. Set to True to also track the user_ids that gave each answer.

Returns:

reduction – A dictionary (formated as a Counter) giving the vote count for each key. If user_ids is True it will also contain a list of user_ids for each answer given.

Return type:

dict


Question Consensus Reducer

This module porvides functions to reduce the question task extracts from panoptes_aggregation.extractors.question_extractor.

panoptes_aggregation.reducers.question_consensus_reducer.question_consensus_reducer(data_list, pairs=False, **kwargs)

Reduce a list of extracted questions into a consensus description dict

Parameters:
Returns:

reduction – A dictinary with the following keys

  • most_likely : key with greatest number of classifications/votes

  • num_votes : vote count for mostly likely key

  • agreement : fraction of total votes held by most likely key.

Return type:

dict


Slider Reducer

This module provides functions to reduce the slider task extracts from panoptes_aggregation.extractors.slider_extractor.

panoptes_aggregation.reducers.slider_reducer.process_data(data, pairs=False)

Process a list of extracted slider into list

Parameters:

data (list) – A list of extractions created by panoptes_aggregation.extractors.question_extractor.slider_extractor()

Returns:

processed_data – A list of slider values, one for each extraction

Return type:

list

panoptes_aggregation.reducers.slider_reducer.slider_reducer(votes_list)

Reduce a list of slider values into a mean and median

Parameters:

votes_list (list) – A list of sldier values from process_data()

Returns:

reduction – A dictionary giving the mean, median, and variance of the slider values

Return type:

dict


Point Reducer

This module provides functions to cluster points extracted with panoptes_aggregation.extractors.point_extractor.

panoptes_aggregation.reducers.point_reducer.point_reducer(data_by_tool, **kwargs)

Cluster a list of points by tool using DBSCAN

This reducer is for use with panoptes_aggregation.extractors.point_extractor that does not seperate points by frame and does not support subtask reduction. Use panoptes_aggregation.extractors.point_extractor_by_frame and panoptes_aggregation.reducers.point_reducer_dbscan if there are multiple frames or subtasks.

Parameters:
Returns:

reduction – A dictinary with the following keys

  • tool*_points_x : A list of x positions for all points drawn with tool*

  • tool*_points_y : A list of y positions for all points drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all points drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The x position for each cluster found

  • tool*_clusters_y : The y position for each cluster found

  • tool*_clusters_var_x : The x varaince of points in each cluster found

  • tool*_clusters_var_y : The y varaince of points in each cluster found

  • tool*_clusters_var_x_y : The x-y covaraince of points in each cluster found

Return type:

dict

panoptes_aggregation.reducers.point_reducer.process_data(data)

Process a list of extractions into lists of x and y sorted by tool.

Parameters:

data (list) – A list of extractions crated by panoptes_aggregation.extractors.point_extractor.point_extractor()

Returns:

processed_data – A dictionary with each key being a tool with a list of (x, y) tuples as a vlaue

Return type:

dict


Point Reducer DBSCAN

This module provides functions to cluster points extracted with panoptes_aggregation.extractors.point_extractor.

panoptes_aggregation.reducers.point_reducer_dbscan.point_reducer_dbscan(data_by_tool, **kwargs)

Cluster a list of points by tool using DBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • kwargsSee DBSCAN

Returns:

reduction – A dictionary with one key per subject frame. Each frame has the following keys

  • tool*_points_x : A list of x positions for all points drawn with tool*

  • tool*_points_y : A list of y positions for all points drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all points drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The x position for each cluster found

  • tool*_clusters_y : The y position for each cluster found

  • tool*_clusters_var_x : The x variance of points in each cluster found

  • tool*_clusters_var_y : The y variance of points in each cluster found

  • tool*_clusters_var_x_y : The x-y covariance of points in each cluster found

Return type:

dict


Point Reducer HDBSCAN

This module provides functions to cluster points extracted with panoptes_aggregation.extractors.point_extractor.

panoptes_aggregation.reducers.point_reducer_hdbscan.point_reducer_hdbscan(data_by_tool, **kwargs)

Cluster a list of points by tool using HDBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • kwargsSee HDBSCAN

Returns:

reduction – A dictionary with one key per subject frame. Each frame has the following keys

  • tool*_points_x : A list of x positions for all points drawn with tool*

  • tool*_points_y : A list of y positions for all points drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all points drawn with tool*

  • tool*_cluster_probabilities: A list of cluster probabilities for all points drawn with tool*

  • tool*_clusters_persistance: A measure for how persistent each cluster is (1.0 = stable, 0.0 = unstable)

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The weighted x position for each cluster found

  • tool*_clusters_y : The weighted y position for each cluster found

  • tool*_clusters_var_x : The weighted x variance of points in each cluster found

  • tool*_clusters_var_y : The weighted y variance of points in each cluster found

  • tool*_clusters_var_x_y : The weighted x-y covariance of points in each cluster found

Return type:

dict


Rectangle Reducer

This module provides functions to cluster rectangles extracted with panoptes_aggregation.extractors.rectangle_extractor.

panoptes_aggregation.reducers.rectangle_reducer.process_data(data)

Process a list of extractions into lists of x and y sorted by frame and tool

Parameters:

data (list) – A list of extractions crated by panoptes_aggregation.extractors.rectangle_extractor.rectangle_extractor()

Returns:

processed_data – A dictionary with each key being a frame dictionary values with keys being tool with a list of (x, y, width, height) tuples as a value

Return type:

dict

panoptes_aggregation.reducers.rectangle_reducer.rectangle_reducer(data_by_tool, **kwargs)

Cluster a list of rectangles by tool and frame

Parameters:
Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_rec_x : A list of x positions for all rectangles drawn with tool*

  • tool*_rec_y : A list of y positions for all rectangles drawn with tool*

  • tool*_rec_width : A list of width values for all rectangles drawn with tool*

  • tool*_rec_height : A list of height values for all rectangles drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all rectangles drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The x position for each cluster found

  • tool*_clusters_y : The y position for each cluster found

  • tool*_clusters_width : The width value for each cluster found

  • tool*_clusters_height : The height value for each cluster found

Return type:

dict


Shape Reducer DBSCAN

This module provides functions to cluster shapes extracted with panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.shape_reducer_dbscan.shape_reducer_dbscan(data_by_tool, **kwargs)

Cluster a shape by tool using DBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:

    • rectangle

    • rotateRectangle

    • circle

    • ellipse

  • kwargsSee DBSCAN

Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_<shape>_<param> : A list of all param for the shape drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_<param> : The param value for each cluster found

If the “IoU” metric type is used there is also

  • tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric

Return type:

dict


Shape Reducer OPTICS

This module provides functions to cluster shapes extracted with panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.shape_reducer_optics.shape_reducer_optics(data_by_tool, **kwargs)

Cluster a shape by tool using OPTICS

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:

    • rectangle

    • rotateRectangle

    • circle

    • ellipse

  • kwargsSee OPTICS

Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_<shape>_<param> : A list of all param for the shape drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_<param> : The param value for each cluster found

If the “IoU” metric type is used there is also

  • tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric

Return type:

dict


Shape Reducer HDBSCAN

This module provides functions to cluster shapes extracted with panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.shape_reducer_hdbscan.shape_reducer_hdbscan(data_by_tool, **kwargs)

Cluster a shape by tool using HDBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:

    • rectangle

    • rotateRectangle

    • circle

    • ellipse

  • kwargsSee HDBSCAN

Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_<shape>_<param> : A list of all param for the shape drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*

  • tool*_cluster_probabilities: A list of cluster probabilities for all points drawn with tool*

  • tool*_clusters_persistance: A measure for how persistent each cluster is (1.0 = stable, 0.0 = unstable)

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_<param> : The param value for each cluster found

If the “IoU” metric type is used there is also

  • tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric

Return type:

dict


Survey Reducer

This module provides functions to reduce survey task extracts from panoptes_aggregation.extractors.survey_extractor.

panoptes_aggregation.reducers.survey_reducer.process_data(data)

Process a list of extracted survey data into a dictionary of sub-question answers sorted organized by choice

Parameters:

data (list) – A list of extractions created by panoptes_aggregation.extractors.survey_extractor.survey_extractor()

Returns:

processed_data – A dictionary where the keys are the choice made and the values are a list of dicts containing Counters for each sub-question asked.

Return type:

dict

panoptes_aggregation.reducers.survey_reducer.survey_reducer(data_in)

Reduce the survey task answers as a list of dicts (one for each choice marked)

Parameters:

data_in (dict) – A dictionary created by process_data()

Returns:

reduction – A list that has one element for choice marked. Each element is a dict of the form

  • choice : The choice made

  • total_vote_count : The number of users that classified the subject

  • choice_count : The number of users that made this choice

  • answers_* : Counters for each answer to sub-question *

Return type:

list


Polygon As Line Tool for Text Reducer

This module provides functions to reduce the polygon-text extractions from panoptes_aggregation.extractors.poly_line_text_extractor.

panoptes_aggregation.reducers.poly_line_text_reducer.poly_line_text_reducer(data_by_frame, **kwargs_dbscan)

Reduce the polygon-text answers as a list of lines of text.

Parameters:
  • data_by_frame (dict) – A dictionary returned by process_data()

  • kwargs

    • See DBSCAN

    • eps_slope : How close the angle of two lines need to be in order to be placed in the same angle cluster.

    • eps_line : How close vertically two lines need to be in order to be identified as the same line.

    • eps_word : How close horizontally the end points of a line need to be in order to be identified as a single point.

    • gutter_tol : How much neighboring columns can overlap horizontally and still be identified as multiple columns.

    • dot_freq : “line” if dots are drawn at the start and end point of a line, “word” if dots are drawn between each word. Note: “word” was proposed for a project but was never used, I don’t expect it ever will. This will likely be depreciated in a future release.

    • min_samples : For all clustering stages this is how many points need to be close together for a cluster to be identified. Set this to 1 for all annotations to be kept

    • min_word_count : The minimum number of times a word must be identified for it to be kept in the consensus text.

    • low_consensus_threshold : The minimum consensus score allowed to be considered “done”

    • minimum_views : A value that is passed along to the font-end to set when lines should turn grey (has no effect on aggregation)

Returns:

reduction – A dictionary with on key for each frame of the subject that have lists as values. Each item of the list represents one line transcribed of text and is a dictionary with these keys:

  • clusters_x : the x position of each identified word

  • clusters_y : the y position of each identified word

  • clusters_text : A list of text at each cluster position

  • gutter_label : A label indicating what “gutter” cluster the line is from

  • line_slope: The slope of the line of text in degrees

  • slope_label : A label indicating what slope cluster the line is from

  • number_views : The number of users that transcribed the line of text

  • consensus_score : The average number of users who’s text agreed for the line. Note, if consensus_score is the same a number_views every user agreed with each other

  • low_consensus : True if the consensus_score is less than the threshold set by the low_consensus_threshold keyword

For the entire subject the following is also returned: * low_consensus_lines : The number of lines with low consensus * transcribed_lines : The total number of lines transcribed on the subject

Note: the image coordiate system has y increasing downward.

Return type:

dict

panoptes_aggregation.reducers.poly_line_text_reducer.process_data(data_list, process_by_line=False)

Process a list of extractions into a dictionary of loc and text organized by frame

Parameters:

data_list (list) – A list of extractions created by panoptes_aggregation.extractors.poly_line_text_extractor.poly_line_text_extractor()

Returns:

processed_data – A dictionary with keys for each frame of the subject and values being dictionaries with x, y, text, and slope keys. x, y, and text are list-of-lists, each inner list is from a single annotaiton, slope is the list of slopes (in deg) for each of these inner lists.

Return type:

dict


Text aggregation utilities

This module provides utility functions used in the polyton-as-line-text-reducer code from panoptes_aggregation.reducers.poly_line_text_reducer.

panoptes_aggregation.reducers.text_utils.align_words(word_line, xy_line, text_line, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one line of text, aligns the words, and finds the end-points for the line.

Parameters:
  • word_line (np.array) – An nx1 array with the x-position of each dot in the rotated coordinate frame.

  • xy_line (np.array) – An nx2 array with the non-rotated (x, y) positions of each dot.

  • text_line (np.array) – An nx1 array with the text for each dot.

  • gs_line (np.array) – An array of bools indicating if the annotation was made in gold standard mode

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

  • clusters_x (list) – A list with the start and end x-position of the line

  • clusters_y (list) – A list with the start and end y-position of the line

  • clusters_text (list) – A list-of-lists with the words transcribed at each dot cluster found. One list per cluster. Note: the empty strings that were added to each annotaiton are stripped before returning the words.

panoptes_aggregation.reducers.text_utils.angle_metric(t1, t2)

A metric for the distance between angles in the [-180, 180] range

Parameters:
  • t1 (float) – Theta one in degrees

  • t2 (float) – Theta two in degrees

Returns:

distance – The distance between the two input angles in degrees

Return type:

float

panoptes_aggregation.reducers.text_utils.avg_angle(theta)

A function that finds the average of an array of angles that are in the range [-180, 180].

Parameters:

theta (array) – An array of angles that are in the range [-180, 180] degrees

Returns:

average – The average angle

Return type:

float

panoptes_aggregation.reducers.text_utils.cluster_by_gutter(x_slope, y_slope, text_slope, gs_slope, data_index_slope, ext_index_slope, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for each frame of a subject and group them based on what side of the page gutter they are on.

Parameters:
  • x_slope (np.array) – A list-of-lists of the x values for each drawn dot. There is one item in the list for annotation made by the user.

  • y_slope (np.array) – A list-of-lists of the y values for each drawn dot. There is one item in the list for annotation made by the user.

  • text_slope (np.array) – A list-of-lists of the text for each drawn dot. There is one item in the list for annotation made by the user.

  • gs_slope (np.array) – A list of bools indicating if the annotation was made in gold standard mode

  • data_index_slope (np.array) – A list of indices indicating what classification each classification came from

  • ext_index_slope (np.array) – A list of extractor indices used to map the reduction to the extract

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

frame_gutter – A list of the resulting extractions, one item per line of text found.

Return type:

list

panoptes_aggregation.reducers.text_utils.cluster_by_line(xy_rotate, xy_gutter, text_gutter, annotation_labels, gs_gutter, data_index_gutter, ext_index_gutter, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one slope_label and cluster them based on perpendicular distance (e.g. lines of text).

Parameters:
  • xy_rotate (np.array) – An array of shape nx2 containing the (x, y) positions of each dot drawn in the rotate coordinate frame.

  • xy_gutter (np.array) – An array of shape nx2 containing the (x, y) positions for each dot drawn.

  • text_gutter (np.array) – An array of shape nx1 containing the text for each dot drawn. Note: each annotation has an empty string added to the end so this array has the same shape as xy_slope.

  • annotation_labels (np.array) – An array of shape nx1 containing a unique label indicating what annotation each position/text came from. This information is used to ensure one annotation does not span multiple lines.

  • gs_gutter (np.array) – An array of bools indicating if the annotation was made in gold standard mode

  • data_index_gutter (np.array) – An array of indices indicating what classification each classification came from

  • ext_index_gutter (np.array) – A list of extractor indices used to map the reduction to the extract

  • kwargs_cluster (dict) – A dictionary containing the eps_*, and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

frame_lines – A list of reductions, one for each line. Each reduction is a dictionary containing the information for the line.

Return type:

list

panoptes_aggregation.reducers.text_utils.cluster_by_slope(x_frame, y_frame, text_frame, slope_frame, gs_frame, data_index_frame, ext_index_frame, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one gutter_label and cluster them based on what slope the transcription is.

Parameters:
  • x_frame (np.array) – A list-of-lists of the x values for each drawn dot. There is one item in the list for annotation made by the user.

  • y_frame (np.array) – A list-of-lists of the y values for each drawn dot. There is one item in the list for annotation made by the user.

  • text_frame (np.array) – A list-of-lists of the text for each drawn dot. There is one item in the list for annotation made by the user. The inner text lists are padded with an empty string at the end so there is the same number of words as there are dots.

  • slope_frame (np.array) – A list of the slopes (in deg) for each annotation

  • gs_frame (np.array) – A list of bools indicating if the annotation was made in gold standard mode

  • data_index_frame (np.array) – A list of indices indicating what classification each classification came from

  • ext_index_frame (np.array) – A list of extractor indices used to map the reduction to the extract

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

frame_slope – A list of the resulting extractions, one item per line of text found.

Return type:

list

panoptes_aggregation.reducers.text_utils.cluster_by_word(word_line, xy_line, text_line, annotation_labels, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one line of text and cluster them based on the words in the line.

Parameters:
  • word_line (np.array) – An nx1 array with the x-position of each dot in the rotated coordinate frame.

  • xy_line (np.array) – An nx2 array with the non-rotated (x, y) positions of each dot.

  • text_line (np.array) – An nx1 array with the text for each dot.

  • annotation_labels (np.array) – An nx1 array with a label indicating what annotaiton each word belongs to.

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

  • clusters_x (list) – A list with the x-position of each dot cluster found

  • clusters_y (list) – A list with the y-position of each dot cluster found

  • clusters_text (list) – A list-of-lists with the words transcribed at each dot cluster found. One list per cluster. Note: the empty strings that were added to each annotaiton are stripped before returning the words.

panoptes_aggregation.reducers.text_utils.consensus_score(clusters_text)

A function to take clustered text data and return the consensus score

Parameters:

clusters_text (list) – A list-of-lists with length equal to the number of words in a line of text and each inner list contains the transcriptions for each word.

Returns:

  • consensus_score (float) – A value indicating the average number of users that agree on the line of text.

  • consensus_text (str) – A string with the consensus sentence

panoptes_aggregation.reducers.text_utils.gutter(lines_in, tol=0)

Cluster list of input line segments by what side of the page gutter they are on.

Parameters:

lines_in (list) – A list-of-lists containing one line segment per item. Each line segment should contain only the x-coordinate of each point on the line.

Returns:

gutter_index – A numpy array containing the cluster label for each input line. This label indicates what side of the gutter(s) the input line segment is on.

Return type:

array

panoptes_aggregation.reducers.text_utils.overlap(x, y, tol=0)

Check if two line segments overlap

Parameters:
  • x (list) – A list with the start and end point of the first line segment

  • y (lits) – A list with the start and end point of the second line segment

  • tol (float) – The tolerance to consider lines overlapping. Default 0, positive value indicate small overlaps are not considered, negative values indicate small gaps are not considered.

Returns:

overlap – True if the two line segments overlap, False otherwise

Return type:

bool

panoptes_aggregation.reducers.text_utils.sort_labels(db_labels, data, reducer=<function mean>, descending=False)

A function that takes in the cluster lables for some data and returns a sorted (by the original data) list of the unique lables in.

Parameters:
  • db_labels (list) – A list of cluster lables, one label for each data point.

  • data (np.array) – The data the lables belong to

  • reducer (function (optional)) – The function used to combine the data for each label. Default: np.mean

  • descending (bool (optional)) – A flag indicating if the lables should be sorted in descending order. Default: False

Returns:

lables – A list of unique cluster lables sorted in either ascending or descending order.

Return type:

list

panoptes_aggregation.reducers.text_utils.tokenize(self, contents)

Tokenize only on space so angle bracket tags are not split


Shakespeares World Variants Reducer

This module provides a fuction to reduce the variants data from extracts.

panoptes_aggregation.reducers.sw_variant_reducer.sw_variant_reducer(extracts)

Reduce all variants for a subject into one list

Parameters:

extracts (list) – A list of extracts created by panoptes_aggregation.extractors.sw_variant_extractor.sw_variant_extractor()

Returns:

reduction – A dictionary with at most one key, variants with the list of all variants in the subject

Return type:

dict


panoptes_aggregation.reducers.dropdown_reducer.dropdown_reducer(votes_list)

Reducer a list-of-lists of Counter objects into one list of dicts

Parameters:

votes_list (list) – A list-of-lists of Counter objects from process_data()

Returns:

reduction – A dictionary with one key value the contains a list of dictionaries (one for each dropdown in the task) giving the vote count for each key

Return type:

dict

panoptes_aggregation.reducers.dropdown_reducer.process_data(data)

Process a list of extracted dropdown answers into Counter objects

Parameters:

data (list) – A list of extractions created by panoptes_aggregation.extractors.dropdown_extractor.dropdown_extractor()

Returns:

process_data – A list-of-lists of Counter objects. The is one element of the outer list for each classification made, and one element of the inner list for each dropdown list in the task.

Return type:

list


TESS Column Reducer

This module provides functions to reduce the column task extracts for the TESS project. Extracts are from panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.tess_reducer_column.process_data(data, **kwargs_extra_data)

Process a list of extractions into lists of x and y sorted by tool

Parameters:

data (list) – A list of extractions crated by panoptes_aggregation.extractors.shape_extractor.shape_extractor()

Returns:

processed_data – A dictionary with two keys

  • data: An Nx2 numpy array containing the center and width of each column drawn

  • index: A list of length N indicating the extract index for each drawn column

Return type:

dict

panoptes_aggregation.reducers.tess_reducer_column.tess_reducer_column(data_by_tool, **kwargs)

Cluster TESS columns using DBSCAN

Parameters:
Returns:

reduction – A dictionary with the following keys

  • centers : A list with the center x position for all identified columns

  • widths : A list with the full width of all identified columns

  • counts : A list with the number of volunteers who identified each column

  • weighted_counts : A list with the weighted number of volunteers who identified each column

  • user_ids: A list of lists with the user_id for each volunteer who marked each column

  • max_weighted_counts: The largest likelihood of a transit for this subject

Return type:

dict


TESS Gold Standard Reducer

This module porvides functions to reduce the gold standard task extracts for the TESS project.

panoptes_aggregation.reducers.tess_gold_standard_reducer.process_data(extracts)

Process the feedback extracts

Parameters:

extracts (list) – A list of extracts from Caesar’s pluck field extractor

Returns:

success – A list-of-lists, one list for each classification with booleans indicating the volunteer’s success at finding each gold standard transit in a subject.

Return type:

list

panoptes_aggregation.reducers.tess_gold_standard_reducer.tess_gold_standard_reducer(data)

Calculate the difficulty of a gold standard TESS subject

Parameters:

data (list) – The results of process_data()

Returns:

output – A dictinary with one key difficulty that is a list with the fraction of volunteers who successfully found each gold standard transit in a subject.

Return type:

dict


Utilities for optics_line_text_reducer

This module provides utilities used to reduce the polygon-text extractions for panoptes_aggregation.reducers.optics_line_text_reducer. It assumes that all extracts are full lines of text in the document.

panoptes_aggregation.reducers.optics_text_utils.cluster_of_one(X, data, user_ids, extract_index)

Create “clusters of one” out of the data passed in. Lines of text identified as noise are kept around as clusters of one so they can be displayed in the front-end to the next user.

Parameters:
  • X (list) – A nx2 list with each row containing [index mapping to data, index mapping to user]

  • data (list) – A list containing dictionaries with the original data that X maps to, of the form {‘x’: [start_x, end_x], ‘y’: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’: bool}.

  • user_ids (list) – A list of user_ids (The second column of X maps to this list)

  • extract_index (list) – A list of n values with the extract index for each of rows in X

Returns:

clusters – A list with n clusters each containing only one classification

Return type:

list

panoptes_aggregation.reducers.optics_text_utils.get_min_samples(N)

Get the min_samples attribute based on the number of users who have transcribed the subject. These values were found based on example data from ASM.

Parameters:

N (integer) – The number of users who have see the subject

Returns:

min_samples – The value to use for the min_samples keyword in OPTICS

Return type:

integer

panoptes_aggregation.reducers.optics_text_utils.metric(a, b, data_in=[])

Calculate the distance between two drawn lines that have text associated with them. This distance is found by summing the euclidean distance between the start points of each line, the euclidean distance between the end points of each line, and the Levenshtein distance of the text for each line. The Levenshtein distance is done after stripping text tags and consolidating whitespace.

To use this metric within the clustering code without haveing to precompute the full distance matrix a and b are index mappings to the data contained in data_in. a and b also contain the user information that is used to help prevent self-clustering.

Parameters:
  • a (list) – A two element list containing [index mapping to data, index mapping to user]

  • b (list) – A two element list containing [index mapping to data, index mapping to user]

  • data_in (list) – A list of dicts that take the form {x: [start_x, end_x], y: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’, bool} There is one element in this list for each classification made.

Returns:

distance – The distance between a and b

Return type:

float

panoptes_aggregation.reducers.optics_text_utils.order_lines(frame_in, angle_eps=30, gutter_eps=150)

Place the identified lines within a single frame in reading order

Parameters:
  • frame (list) – A list of identified transcribed lines (one frame from panoptes_aggregation.reducers.optics_line_text_reducer.optics_line_text_reducer)

  • angle_eps (float) – The DBSCAN eps value to use for the slope clustering

  • gutter_eps (float) – The DBSCAN eps value to use for the column clustering

Returns:

frame_ordered – The identified transcribed lines in reading order. The slope_label and gutter_label values are added to each line to indicate what cluster it belongs to.

Return type:

list

panoptes_aggregation.reducers.optics_text_utils.remove_user_duplication(labels_, core_distances_, users)

Make sure a users only shows up in a cluster at most once. If a user does show up more than once in a cluster take the point with the smallest core distance, all others are assigned as noise (-1).

Parameters:
  • labels_ (numpy.array) – A list containing the cluster labels for each data point

  • core_distances_ (numpy.array) – A list of core distance for each data point

  • users (numpy.array) – A list of indices that map to users, one for each data point

Returns:

clean_labels_ – A list containing the new cluster labels.

Return type:

numpy.array

panoptes_aggregation.reducers.optics_text_utils.strip_tags(s)

Remove square bracket tags from text and consolidating whitespace

Parameters:

s (string) – The input string

Returns:

clean_s – The cleaned string

Return type:

string


Line Tool with Text Subtask Reducer using OPTICS

This module provides functions to reduce the polygon-text extractions from panoptes_aggregation.extractors.poly_line_text_extractor using the density independent clustering algorithm OPTICS. It is assumed that all extracts are full lines of text in the document.

panoptes_aggregation.reducers.optics_line_text_reducer.optics_line_text_reducer(data_by_frame, **kwargs_optics)

Reduce the line-text extracts as a list of lines of text.

Parameters:
  • data_by_frame (dict) – A dictionary returned by process_data()

  • kwargs

    • See OPTICS

    • min_samples : The smallest number of transcribed lines needed to form a cluster. auto will set this value based on the number of volunteers who transcribed on a page within a subject.

    • xi : Determines the minimum steepness on the reachability plot that constitutes a cluster boundary.

    • angle_eps : How close the angle of two lines need to be in order to be placed in the same angle cluster. Note: This will only change the order of the lines.

    • gutter_eps : How close the x position of the start of two lines need to be in order to be placed in the same column cluster. Note: This will only change the order of the lines.

    • min_line_length : The minimum length a transcribed line of text needs to be in order to be used in the reduction.

    • low_consensus_threshold : The minimum consensus score allowed to be considered “done”.

    • minimum_views : A value that is passed along to the font-end to set when lines should turn grey (has no effect on aggregation)

Returns:

reduction – A dictionary with on key for each frame of the subject that have lists as values. Each item of the list represents one line transcribed of text and is a dictionary with these keys:

  • clusters_x : the x position of each identified word

  • clusters_y : the y position of each identified word

  • clusters_text : A list of lists containing the text at each cluster position There is one list for each identified word, and each of those lists contains one item for each user that identified the cluster. If the user did not transcribe the word an empty string is used.

  • line_slope: The slope of the line of text in degrees

  • number_views : The number of users that transcribed the line of text

  • consensus_score : The average number of users who’s text agreed for the line Note, if consensus_score is the same a number_views every user agreed with each other

  • user_ids: List of panoptes user ids in the same order as clusters_text

  • gold_standard: List of bools indicating of the if a transcription was made in frontends gold standard mode

  • slope_label: integer indicating what slope cluster the line belongs to

  • gutter_label: integer indicating what gutter cluster (i.e. column) the line belongs to

  • low_consensus : True if the consensus_score is less than the threshold set by the low_consensus_threshold keyword

For the entire subject the following is also returned: * low_consensus_lines : The number of lines with low consensus * transcribed_lines : The total number of lines transcribed on the subject

Note: the image coordinate system has y increasing downward.

Return type:

dict

panoptes_aggregation.reducers.optics_line_text_reducer.process_data(data_list, min_line_length=0.0)

Process a list of extractions into a dictionary organized by frame

Parameters:

data_list (list) – A list of extractions created by panoptes_aggregation.extractors.poly_line_text_extractor.poly_line_text_extractor()

Returns:

processed_data – A dictionary with one key for each frame of the subject. The value for each key is a dictionary with two keys X and data. X is a 2D array with each row mapping to the data held in data. The first column contains row indices and the second column is an index assigned to each user. data is a list of dictionaries of the form {‘x’: [start_x, end_x], ‘y’: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’: bool}.

Return type:

dict


Text Tool Reducer

This module provides functions to reducer the panoptes text tool into an alignment table.

panoptes_aggregation.reducers.text_reducer.process_data(data_list)

Flatten list of extracts into a list of strings. Empty strings are not returned

panoptes_aggregation.reducers.text_reducer.text_reducer(data_in, **kwargs)

Reduce a list of text into an alignment table :Parameters: data (list) – A list of strings to be aligned

Returns:

reduction – A dictionary with the following keys:

  • aligned_text: A list of lists containing the aligned text. There is one list for each identified word, and each of those lists contains one item for each user that entered text. If the user did not transcribe a word an empty string is used.

  • number_views: Number of volunteers who entered non-blank text

  • consensus_score: The average number of users who’s text agreed. Note, if consensus_score is the same a number_views every user agreed with each other

Return type:

dict


First N True Reducer

This module is designed to reduce boolean-valued extracts e.g. panoptes_aggregation.extractors.all_tasks_empty_extractor. It returns true if and only if the first N extracts are True.

panoptes_aggregation.reducers.first_n_true_reducer.first_n_true_reducer(data_list, n=0, **kwargs)

Reduce a list of boolean values to a single boolean value.

Parameters:
  • data_list (list) – A list of dicts containing a “result” key which should correspond with a boolean value.

  • n (int) – The first n results in data_list must be True.

Returns:

reductionreduction[“result”] is True if the first n results in data_list are True. Otherwise False.

Return type:

dict