Reducers
Question Reducer
This module provides functions to reduce the question task extracts from
panoptes_aggregation.extractors.question_extractor
.
- panoptes_aggregation.reducers.question_reducer.question_reducer(data_list, pairs=False, track_user_ids=False, **kwargs)
Reduce a list of extracted questions into a “counter” dict
- Parameters:
data_list (list) – A list of extractions created by
panoptes_aggregation.extractors.question_extractor.question_extractor()
pairs (bool, optional) – Default False. How multiple choice questions are treated. When True the set of all choices is treated as a single answer
track_user_ids (bool, optional) – Default False. Set to True to also track the user_ids that gave each answer.
- Returns:
reduction – A dictionary (formated as a Counter) giving the vote count for each key. If user_ids is True it will also contain a list of user_ids for each answer given.
- Return type:
Question Consensus Reducer
This module porvides functions to reduce the question task extracts from
panoptes_aggregation.extractors.question_extractor
.
- panoptes_aggregation.reducers.question_consensus_reducer.question_consensus_reducer(data_list, pairs=False, **kwargs)
Reduce a list of extracted questions into a consensus description dict
- Parameters:
data_list (list) – A list of extractions created by
panoptes_aggregation.extractors.question_extractor.question_extractor()
pairs (bool, optional) – Default False. How multiple choice questions are treated. When True the set of all choices is treated as a single answer
- Returns:
reduction – A dictinary with the following keys
most_likely : key with greatest number of classifications/votes
num_votes : vote count for mostly likely key
agreement : fraction of total votes held by most likely key.
- Return type:
Slider Reducer
This module provides functions to reduce the slider task extracts from
panoptes_aggregation.extractors.slider_extractor
.
- panoptes_aggregation.reducers.slider_reducer.process_data(data, pairs=False)
Process a list of extracted slider into list
- Parameters:
data (list) – A list of extractions created by
panoptes_aggregation.extractors.question_extractor.slider_extractor()
- Returns:
processed_data – A list of slider values, one for each extraction
- Return type:
- panoptes_aggregation.reducers.slider_reducer.slider_reducer(votes_list)
Reduce a list of slider values into a mean and median
- Parameters:
votes_list (list) – A list of sldier values from
process_data()
- Returns:
reduction – A dictionary giving the mean, median, and variance of the slider values
- Return type:
Point Reducer
This module provides functions to cluster points extracted with
panoptes_aggregation.extractors.point_extractor
.
- panoptes_aggregation.reducers.point_reducer.point_reducer(data_by_tool, **kwargs)
Cluster a list of points by tool using DBSCAN
This reducer is for use with
panoptes_aggregation.extractors.point_extractor
that does not seperate points by frame and does not support subtask reduction. Usepanoptes_aggregation.extractors.point_extractor_by_frame
andpanoptes_aggregation.reducers.point_reducer_dbscan
if there are multiple frames or subtasks.- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
kwrgs – See DBSCAN
- Returns:
reduction – A dictinary with the following keys
tool*_points_x : A list of x positions for all points drawn with tool*
tool*_points_y : A list of y positions for all points drawn with tool*
tool*_cluster_labels : A list of cluster labels for all points drawn with tool*
tool*_clusters_count : The number of points in each cluster found
tool*_clusters_x : The x position for each cluster found
tool*_clusters_y : The y position for each cluster found
tool*_clusters_var_x : The x varaince of points in each cluster found
tool*_clusters_var_y : The y varaince of points in each cluster found
tool*_clusters_var_x_y : The x-y covaraince of points in each cluster found
- Return type:
- panoptes_aggregation.reducers.point_reducer.process_data(data)
Process a list of extractions into lists of x and y sorted by tool.
- Parameters:
data (list) – A list of extractions crated by
panoptes_aggregation.extractors.point_extractor.point_extractor()
- Returns:
processed_data – A dictionary with each key being a tool with a list of (x, y) tuples as a vlaue
- Return type:
Point Reducer DBSCAN
This module provides functions to cluster points extracted with
panoptes_aggregation.extractors.point_extractor
.
- panoptes_aggregation.reducers.point_reducer_dbscan.point_reducer_dbscan(data_by_tool, **kwargs)
Cluster a list of points by tool using DBSCAN
- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
kwargs – See DBSCAN
- Returns:
reduction – A dictionary with one key per subject frame. Each frame has the following keys
tool*_points_x : A list of x positions for all points drawn with tool*
tool*_points_y : A list of y positions for all points drawn with tool*
tool*_cluster_labels : A list of cluster labels for all points drawn with tool*
tool*_clusters_count : The number of points in each cluster found
tool*_clusters_x : The x position for each cluster found
tool*_clusters_y : The y position for each cluster found
tool*_clusters_var_x : The x variance of points in each cluster found
tool*_clusters_var_y : The y variance of points in each cluster found
tool*_clusters_var_x_y : The x-y covariance of points in each cluster found
- Return type:
Point Reducer HDBSCAN
This module provides functions to cluster points extracted with
panoptes_aggregation.extractors.point_extractor
.
- panoptes_aggregation.reducers.point_reducer_hdbscan.point_reducer_hdbscan(data_by_tool, **kwargs)
Cluster a list of points by tool using HDBSCAN
- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
kwargs – See HDBSCAN
- Returns:
reduction – A dictionary with one key per subject frame. Each frame has the following keys
tool*_points_x : A list of x positions for all points drawn with tool*
tool*_points_y : A list of y positions for all points drawn with tool*
tool*_cluster_labels : A list of cluster labels for all points drawn with tool*
tool*_cluster_probabilities: A list of cluster probabilities for all points drawn with tool*
tool*_clusters_persistance: A measure for how persistent each cluster is (1.0 = stable, 0.0 = unstable)
tool*_clusters_count : The number of points in each cluster found
tool*_clusters_x : The weighted x position for each cluster found
tool*_clusters_y : The weighted y position for each cluster found
tool*_clusters_var_x : The weighted x variance of points in each cluster found
tool*_clusters_var_y : The weighted y variance of points in each cluster found
tool*_clusters_var_x_y : The weighted x-y covariance of points in each cluster found
- Return type:
Rectangle Reducer
This module provides functions to cluster rectangles extracted with
panoptes_aggregation.extractors.rectangle_extractor
.
- panoptes_aggregation.reducers.rectangle_reducer.process_data(data)
Process a list of extractions into lists of x and y sorted by frame and tool
- Parameters:
data (list) – A list of extractions crated by
panoptes_aggregation.extractors.rectangle_extractor.rectangle_extractor()
- Returns:
processed_data – A dictionary with each key being a frame dictionary values with keys being tool with a list of (x, y, width, height) tuples as a value
- Return type:
- panoptes_aggregation.reducers.rectangle_reducer.rectangle_reducer(data_by_tool, **kwargs)
Cluster a list of rectangles by tool and frame
- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
kwargs – See DBSCAN
- Returns:
reduction – A dictionary with the following keys for each frame
tool*_rec_x : A list of x positions for all rectangles drawn with tool*
tool*_rec_y : A list of y positions for all rectangles drawn with tool*
tool*_rec_width : A list of width values for all rectangles drawn with tool*
tool*_rec_height : A list of height values for all rectangles drawn with tool*
tool*_cluster_labels : A list of cluster labels for all rectangles drawn with tool*
tool*_clusters_count : The number of points in each cluster found
tool*_clusters_x : The x position for each cluster found
tool*_clusters_y : The y position for each cluster found
tool*_clusters_width : The width value for each cluster found
tool*_clusters_height : The height value for each cluster found
- Return type:
Shape Reducer DBSCAN
This module provides functions to cluster shapes extracted with
panoptes_aggregation.extractors.shape_extractor
.
- panoptes_aggregation.reducers.shape_reducer_dbscan.shape_reducer_dbscan(data_by_tool, **kwargs)
Cluster a shape by tool using DBSCAN
- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:
rectangle
rotateRectangle
circle
ellipse
kwargs – See DBSCAN
- Returns:
reduction – A dictionary with the following keys for each frame
tool*_<shape>_<param> : A list of all param for the shape drawn with tool*
tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*
tool*_clusters_count : The number of points in each cluster found
tool*_clusters_<param> : The param value for each cluster found
If the “IoU” metric type is used there is also
tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric
- Return type:
Shape Reducer OPTICS
This module provides functions to cluster shapes extracted with
panoptes_aggregation.extractors.shape_extractor
.
- panoptes_aggregation.reducers.shape_reducer_optics.shape_reducer_optics(data_by_tool, **kwargs)
Cluster a shape by tool using OPTICS
- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:
rectangle
rotateRectangle
circle
ellipse
kwargs – See OPTICS
- Returns:
reduction – A dictionary with the following keys for each frame
tool*_<shape>_<param> : A list of all param for the shape drawn with tool*
tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*
tool*_clusters_count : The number of points in each cluster found
tool*_clusters_<param> : The param value for each cluster found
If the “IoU” metric type is used there is also
tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric
- Return type:
Shape Reducer HDBSCAN
This module provides functions to cluster shapes extracted with
panoptes_aggregation.extractors.shape_extractor
.
- panoptes_aggregation.reducers.shape_reducer_hdbscan.shape_reducer_hdbscan(data_by_tool, **kwargs)
Cluster a shape by tool using HDBSCAN
- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:
rectangle
rotateRectangle
circle
ellipse
kwargs – See HDBSCAN
- Returns:
reduction – A dictionary with the following keys for each frame
tool*_<shape>_<param> : A list of all param for the shape drawn with tool*
tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*
tool*_cluster_probabilities: A list of cluster probabilities for all points drawn with tool*
tool*_clusters_persistance: A measure for how persistent each cluster is (1.0 = stable, 0.0 = unstable)
tool*_clusters_count : The number of points in each cluster found
tool*_clusters_<param> : The param value for each cluster found
If the “IoU” metric type is used there is also
tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric
- Return type:
Survey Reducer
This module provides functions to reduce survey task extracts from
panoptes_aggregation.extractors.survey_extractor
.
- panoptes_aggregation.reducers.survey_reducer.process_data(data)
Process a list of extracted survey data into a dictionary of sub-question answers sorted organized by choice
- Parameters:
data (list) – A list of extractions created by
panoptes_aggregation.extractors.survey_extractor.survey_extractor()
- Returns:
processed_data – A dictionary where the keys are the choice made and the values are a list of dicts containing Counters for each sub-question asked.
- Return type:
- panoptes_aggregation.reducers.survey_reducer.survey_reducer(data_in)
Reduce the survey task answers as a list of dicts (one for each choice marked)
- Parameters:
data_in (dict) – A dictionary created by
process_data()
- Returns:
reduction – A list that has one element for choice marked. Each element is a dict of the form
choice : The choice made
total_vote_count : The number of users that classified the subject
choice_count : The number of users that made this choice
answers_* : Counters for each answer to sub-question *
- Return type:
Polygon As Line Tool for Text Reducer
This module provides functions to reduce the polygon-text extractions from
panoptes_aggregation.extractors.poly_line_text_extractor
.
- panoptes_aggregation.reducers.poly_line_text_reducer.poly_line_text_reducer(data_by_frame, **kwargs_dbscan)
Reduce the polygon-text answers as a list of lines of text.
- Parameters:
data_by_frame (dict) – A dictionary returned by
process_data()
kwargs –
eps_slope : How close the angle of two lines need to be in order to be placed in the same angle cluster.
eps_line : How close vertically two lines need to be in order to be identified as the same line.
eps_word : How close horizontally the end points of a line need to be in order to be identified as a single point.
gutter_tol : How much neighboring columns can overlap horizontally and still be identified as multiple columns.
dot_freq : “line” if dots are drawn at the start and end point of a line, “word” if dots are drawn between each word. Note: “word” was proposed for a project but was never used, I don’t expect it ever will. This will likely be depreciated in a future release.
min_samples : For all clustering stages this is how many points need to be close together for a cluster to be identified. Set this to 1 for all annotations to be kept
min_word_count : The minimum number of times a word must be identified for it to be kept in the consensus text.
low_consensus_threshold : The minimum consensus score allowed to be considered “done”
minimum_views : A value that is passed along to the font-end to set when lines should turn grey (has no effect on aggregation)
- Returns:
reduction – A dictionary with on key for each frame of the subject that have lists as values. Each item of the list represents one line transcribed of text and is a dictionary with these keys:
clusters_x : the x position of each identified word
clusters_y : the y position of each identified word
clusters_text : A list of text at each cluster position
gutter_label : A label indicating what “gutter” cluster the line is from
line_slope: The slope of the line of text in degrees
slope_label : A label indicating what slope cluster the line is from
number_views : The number of users that transcribed the line of text
consensus_score : The average number of users who’s text agreed for the line. Note, if consensus_score is the same a number_views every user agreed with each other
low_consensus : True if the consensus_score is less than the threshold set by the low_consensus_threshold keyword
For the entire subject the following is also returned: * low_consensus_lines : The number of lines with low consensus * transcribed_lines : The total number of lines transcribed on the subject
Note: the image coordiate system has y increasing downward.
- Return type:
- panoptes_aggregation.reducers.poly_line_text_reducer.process_data(data_list, process_by_line=False)
Process a list of extractions into a dictionary of loc and text organized by frame
- Parameters:
data_list (list) – A list of extractions created by
panoptes_aggregation.extractors.poly_line_text_extractor.poly_line_text_extractor()
- Returns:
processed_data – A dictionary with keys for each frame of the subject and values being dictionaries with x, y, text, and slope keys. x, y, and text are list-of-lists, each inner list is from a single annotaiton, slope is the list of slopes (in deg) for each of these inner lists.
- Return type:
Text aggregation utilities
This module provides utility functions used in the polyton-as-line-text-reducer code from
panoptes_aggregation.reducers.poly_line_text_reducer
.
- panoptes_aggregation.reducers.text_utils.align_words(word_line, xy_line, text_line, kwargs_cluster, kwargs_dbscan)
A function to take the annotations for one line of text, aligns the words, and finds the end-points for the line.
- Parameters:
word_line (np.array) – An nx1 array with the x-position of each dot in the rotated coordinate frame.
xy_line (np.array) – An nx2 array with the non-rotated (x, y) positions of each dot.
text_line (np.array) – An nx1 array with the text for each dot.
gs_line (np.array) – An array of bools indicating if the annotation was made in gold standard mode
kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords
kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords
- Returns:
clusters_x (list) – A list with the start and end x-position of the line
clusters_y (list) – A list with the start and end y-position of the line
clusters_text (list) – A list-of-lists with the words transcribed at each dot cluster found. One list per cluster. Note: the empty strings that were added to each annotaiton are stripped before returning the words.
- panoptes_aggregation.reducers.text_utils.angle_metric(t1, t2)
A metric for the distance between angles in the [-180, 180] range
- Parameters:
t1 (float) – Theta one in degrees
t2 (float) – Theta two in degrees
- Returns:
distance – The distance between the two input angles in degrees
- Return type:
- panoptes_aggregation.reducers.text_utils.avg_angle(theta)
A function that finds the average of an array of angles that are in the range [-180, 180].
- Parameters:
theta (array) – An array of angles that are in the range [-180, 180] degrees
- Returns:
average – The average angle
- Return type:
- panoptes_aggregation.reducers.text_utils.cluster_by_gutter(x_slope, y_slope, text_slope, gs_slope, data_index_slope, ext_index_slope, kwargs_cluster, kwargs_dbscan)
A function to take the annotations for each frame of a subject and group them based on what side of the page gutter they are on.
- Parameters:
x_slope (np.array) – A list-of-lists of the x values for each drawn dot. There is one item in the list for annotation made by the user.
y_slope (np.array) – A list-of-lists of the y values for each drawn dot. There is one item in the list for annotation made by the user.
text_slope (np.array) – A list-of-lists of the text for each drawn dot. There is one item in the list for annotation made by the user.
gs_slope (np.array) – A list of bools indicating if the annotation was made in gold standard mode
data_index_slope (np.array) – A list of indices indicating what classification each classification came from
ext_index_slope (np.array) – A list of extractor indices used to map the reduction to the extract
kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords
kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords
- Returns:
frame_gutter – A list of the resulting extractions, one item per line of text found.
- Return type:
- panoptes_aggregation.reducers.text_utils.cluster_by_line(xy_rotate, xy_gutter, text_gutter, annotation_labels, gs_gutter, data_index_gutter, ext_index_gutter, kwargs_cluster, kwargs_dbscan)
A function to take the annotations for one slope_label and cluster them based on perpendicular distance (e.g. lines of text).
- Parameters:
xy_rotate (np.array) – An array of shape nx2 containing the (x, y) positions of each dot drawn in the rotate coordinate frame.
xy_gutter (np.array) – An array of shape nx2 containing the (x, y) positions for each dot drawn.
text_gutter (np.array) – An array of shape nx1 containing the text for each dot drawn. Note: each annotation has an empty string added to the end so this array has the same shape as xy_slope.
annotation_labels (np.array) – An array of shape nx1 containing a unique label indicating what annotation each position/text came from. This information is used to ensure one annotation does not span multiple lines.
gs_gutter (np.array) – An array of bools indicating if the annotation was made in gold standard mode
data_index_gutter (np.array) – An array of indices indicating what classification each classification came from
ext_index_gutter (np.array) – A list of extractor indices used to map the reduction to the extract
kwargs_cluster (dict) – A dictionary containing the eps_*, and dot_freq keywords
kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords
- Returns:
frame_lines – A list of reductions, one for each line. Each reduction is a dictionary containing the information for the line.
- Return type:
- panoptes_aggregation.reducers.text_utils.cluster_by_slope(x_frame, y_frame, text_frame, slope_frame, gs_frame, data_index_frame, ext_index_frame, kwargs_cluster, kwargs_dbscan)
A function to take the annotations for one gutter_label and cluster them based on what slope the transcription is.
- Parameters:
x_frame (np.array) – A list-of-lists of the x values for each drawn dot. There is one item in the list for annotation made by the user.
y_frame (np.array) – A list-of-lists of the y values for each drawn dot. There is one item in the list for annotation made by the user.
text_frame (np.array) – A list-of-lists of the text for each drawn dot. There is one item in the list for annotation made by the user. The inner text lists are padded with an empty string at the end so there is the same number of words as there are dots.
slope_frame (np.array) – A list of the slopes (in deg) for each annotation
gs_frame (np.array) – A list of bools indicating if the annotation was made in gold standard mode
data_index_frame (np.array) – A list of indices indicating what classification each classification came from
ext_index_frame (np.array) – A list of extractor indices used to map the reduction to the extract
kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords
kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords
- Returns:
frame_slope – A list of the resulting extractions, one item per line of text found.
- Return type:
- panoptes_aggregation.reducers.text_utils.cluster_by_word(word_line, xy_line, text_line, annotation_labels, kwargs_cluster, kwargs_dbscan)
A function to take the annotations for one line of text and cluster them based on the words in the line.
- Parameters:
word_line (np.array) – An nx1 array with the x-position of each dot in the rotated coordinate frame.
xy_line (np.array) – An nx2 array with the non-rotated (x, y) positions of each dot.
text_line (np.array) – An nx1 array with the text for each dot.
annotation_labels (np.array) – An nx1 array with a label indicating what annotaiton each word belongs to.
kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords
kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords
- Returns:
clusters_x (list) – A list with the x-position of each dot cluster found
clusters_y (list) – A list with the y-position of each dot cluster found
clusters_text (list) – A list-of-lists with the words transcribed at each dot cluster found. One list per cluster. Note: the empty strings that were added to each annotaiton are stripped before returning the words.
- panoptes_aggregation.reducers.text_utils.consensus_score(clusters_text)
A function to take clustered text data and return the consensus score
- Parameters:
clusters_text (list) – A list-of-lists with length equal to the number of words in a line of text and each inner list contains the transcriptions for each word.
- Returns:
consensus_score (float) – A value indicating the average number of users that agree on the line of text.
consensus_text (str) – A string with the consensus sentence
- panoptes_aggregation.reducers.text_utils.gutter(lines_in, tol=0)
Cluster list of input line segments by what side of the page gutter they are on.
- Parameters:
lines_in (list) – A list-of-lists containing one line segment per item. Each line segment should contain only the x-coordinate of each point on the line.
- Returns:
gutter_index – A numpy array containing the cluster label for each input line. This label indicates what side of the gutter(s) the input line segment is on.
- Return type:
array
- panoptes_aggregation.reducers.text_utils.overlap(x, y, tol=0)
Check if two line segments overlap
- Parameters:
x (list) – A list with the start and end point of the first line segment
y (lits) – A list with the start and end point of the second line segment
tol (float) – The tolerance to consider lines overlapping. Default 0, positive value indicate small overlaps are not considered, negative values indicate small gaps are not considered.
- Returns:
overlap – True if the two line segments overlap, False otherwise
- Return type:
- panoptes_aggregation.reducers.text_utils.sort_labels(db_labels, data, reducer=<function mean>, descending=False)
A function that takes in the cluster lables for some data and returns a sorted (by the original data) list of the unique lables in.
- Parameters:
db_labels (list) – A list of cluster lables, one label for each data point.
data (np.array) – The data the lables belong to
reducer (function (optional)) – The function used to combine the data for each label. Default: np.mean
descending (bool (optional)) – A flag indicating if the lables should be sorted in descending order. Default: False
- Returns:
lables – A list of unique cluster lables sorted in either ascending or descending order.
- Return type:
- panoptes_aggregation.reducers.text_utils.tokenize(self, contents)
Tokenize only on space so angle bracket tags are not split
Shakespeares World Variants Reducer
This module provides a fuction to reduce the variants data from extracts.
- panoptes_aggregation.reducers.sw_variant_reducer.sw_variant_reducer(extracts)
Reduce all variants for a subject into one list
- Parameters:
extracts (list) – A list of extracts created by
panoptes_aggregation.extractors.sw_variant_extractor.sw_variant_extractor()
- Returns:
reduction – A dictionary with at most one key, variants with the list of all variants in the subject
- Return type:
Dropdown Reducer
This module porvides functions to reduce the dropdown task extracts from
panoptes_aggregation.extractors.dropdown_extractor
.
- panoptes_aggregation.reducers.dropdown_reducer.dropdown_reducer(votes_list)
Reducer a list-of-lists of Counter objects into one list of dicts
- Parameters:
votes_list (list) – A list-of-lists of Counter objects from
process_data()
- Returns:
reduction – A dictionary with one key value the contains a list of dictionaries (one for each dropdown in the task) giving the vote count for each key
- Return type:
- panoptes_aggregation.reducers.dropdown_reducer.process_data(data)
Process a list of extracted dropdown answers into Counter objects
- Parameters:
data (list) – A list of extractions created by
panoptes_aggregation.extractors.dropdown_extractor.dropdown_extractor()
- Returns:
process_data – A list-of-lists of Counter objects. The is one element of the outer list for each classification made, and one element of the inner list for each dropdown list in the task.
- Return type:
TESS Column Reducer
This module provides functions to reduce the column task extracts for the TESS project.
Extracts are from panoptes_aggregation.extractors.shape_extractor
.
- panoptes_aggregation.reducers.tess_reducer_column.process_data(data, **kwargs_extra_data)
Process a list of extractions into lists of x and y sorted by tool
- Parameters:
data (list) – A list of extractions crated by
panoptes_aggregation.extractors.shape_extractor.shape_extractor()
- Returns:
processed_data – A dictionary with two keys
data: An Nx2 numpy array containing the center and width of each column drawn
index: A list of length N indicating the extract index for each drawn column
- Return type:
- panoptes_aggregation.reducers.tess_reducer_column.tess_reducer_column(data_by_tool, **kwargs)
Cluster TESS columns using DBSCAN
- Parameters:
data_by_tool (dict) – A dictionary returned by
process_data()
user_id (keyword, list) – A list containing the user IDs for each extract
relevant_reduction (keyword, list) – A list containing the TESS user reduction for each extract
panoptes_aggregation.running_reducers.tess_user_reducer.tess_user_reducer()
x (keyword, str) – Either “center” or “left” and indicates if the x value of the classification is the center or left side of the column
kwargs – See DBSCAN
- Returns:
reduction – A dictionary with the following keys
centers : A list with the center x position for all identified columns
widths : A list with the full width of all identified columns
counts : A list with the number of volunteers who identified each column
weighted_counts : A list with the weighted number of volunteers who identified each column
user_ids: A list of lists with the user_id for each volunteer who marked each column
max_weighted_counts: The largest likelihood of a transit for this subject
- Return type:
TESS Gold Standard Reducer
This module porvides functions to reduce the gold standard task extracts for the TESS project.
- panoptes_aggregation.reducers.tess_gold_standard_reducer.process_data(extracts)
Process the feedback extracts
- Parameters:
extracts (list) – A list of extracts from Caesar’s pluck field extractor
- Returns:
success – A list-of-lists, one list for each classification with booleans indicating the volunteer’s success at finding each gold standard transit in a subject.
- Return type:
- panoptes_aggregation.reducers.tess_gold_standard_reducer.tess_gold_standard_reducer(data)
Calculate the difficulty of a gold standard TESS subject
- Parameters:
data (list) – The results of
process_data()
- Returns:
output – A dictinary with one key difficulty that is a list with the fraction of volunteers who successfully found each gold standard transit in a subject.
- Return type:
Utilities for optics_line_text_reducer
This module provides utilities used to reduce the polygon-text extractions
for panoptes_aggregation.reducers.optics_line_text_reducer
. It
assumes that all extracts are full lines of text in the document.
- panoptes_aggregation.reducers.optics_text_utils.cluster_of_one(X, data, user_ids, extract_index)
Create “clusters of one” out of the data passed in. Lines of text identified as noise are kept around as clusters of one so they can be displayed in the front-end to the next user.
- Parameters:
X (list) – A nx2 list with each row containing [index mapping to data, index mapping to user]
data (list) – A list containing dictionaries with the original data that X maps to, of the form {‘x’: [start_x, end_x], ‘y’: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’: bool}.
user_ids (list) – A list of user_ids (The second column of X maps to this list)
extract_index (list) – A list of n values with the extract index for each of rows in X
- Returns:
clusters – A list with n clusters each containing only one classification
- Return type:
- panoptes_aggregation.reducers.optics_text_utils.get_min_samples(N)
Get the min_samples attribute based on the number of users who have transcribed the subject. These values were found based on example data from ASM.
- Parameters:
N (integer) – The number of users who have see the subject
- Returns:
min_samples – The value to use for the min_samples keyword in OPTICS
- Return type:
integer
- panoptes_aggregation.reducers.optics_text_utils.metric(a, b, data_in=[])
Calculate the distance between two drawn lines that have text associated with them. This distance is found by summing the euclidean distance between the start points of each line, the euclidean distance between the end points of each line, and the Levenshtein distance of the text for each line. The Levenshtein distance is done after stripping text tags and consolidating whitespace.
To use this metric within the clustering code without haveing to precompute the full distance matrix a and b are index mappings to the data contained in data_in. a and b also contain the user information that is used to help prevent self-clustering.
- Parameters:
a (list) – A two element list containing [index mapping to data, index mapping to user]
b (list) – A two element list containing [index mapping to data, index mapping to user]
data_in (list) – A list of dicts that take the form {x: [start_x, end_x], y: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’, bool} There is one element in this list for each classification made.
- Returns:
distance – The distance between a and b
- Return type:
- panoptes_aggregation.reducers.optics_text_utils.order_lines(frame_in, angle_eps=30, gutter_eps=150)
Place the identified lines within a single frame in reading order
- Parameters:
frame (list) – A list of identified transcribed lines (one frame from panoptes_aggregation.reducers.optics_line_text_reducer.optics_line_text_reducer)
angle_eps (float) – The DBSCAN eps value to use for the slope clustering
gutter_eps (float) – The DBSCAN eps value to use for the column clustering
- Returns:
frame_ordered – The identified transcribed lines in reading order. The slope_label and gutter_label values are added to each line to indicate what cluster it belongs to.
- Return type:
- panoptes_aggregation.reducers.optics_text_utils.remove_user_duplication(labels_, core_distances_, users)
Make sure a users only shows up in a cluster at most once. If a user does show up more than once in a cluster take the point with the smallest core distance, all others are assigned as noise (-1).
- Parameters:
labels_ (numpy.array) – A list containing the cluster labels for each data point
core_distances_ (numpy.array) – A list of core distance for each data point
users (numpy.array) – A list of indices that map to users, one for each data point
- Returns:
clean_labels_ – A list containing the new cluster labels.
- Return type:
numpy.array
- panoptes_aggregation.reducers.optics_text_utils.strip_tags(s)
Remove square bracket tags from text and consolidating whitespace
- Parameters:
s (string) – The input string
- Returns:
clean_s – The cleaned string
- Return type:
string
Line Tool with Text Subtask Reducer using OPTICS
This module provides functions to reduce the polygon-text extractions from
panoptes_aggregation.extractors.poly_line_text_extractor
using the
density independent clustering algorithm OPTICS. It is assumed that all
extracts are full lines of text in the document.
- panoptes_aggregation.reducers.optics_line_text_reducer.optics_line_text_reducer(data_by_frame, **kwargs_optics)
Reduce the line-text extracts as a list of lines of text.
- Parameters:
data_by_frame (dict) – A dictionary returned by
process_data()
kwargs –
min_samples : The smallest number of transcribed lines needed to form a cluster. auto will set this value based on the number of volunteers who transcribed on a page within a subject.
xi : Determines the minimum steepness on the reachability plot that constitutes a cluster boundary.
angle_eps : How close the angle of two lines need to be in order to be placed in the same angle cluster. Note: This will only change the order of the lines.
gutter_eps : How close the x position of the start of two lines need to be in order to be placed in the same column cluster. Note: This will only change the order of the lines.
min_line_length : The minimum length a transcribed line of text needs to be in order to be used in the reduction.
low_consensus_threshold : The minimum consensus score allowed to be considered “done”.
minimum_views : A value that is passed along to the font-end to set when lines should turn grey (has no effect on aggregation)
- Returns:
reduction – A dictionary with on key for each frame of the subject that have lists as values. Each item of the list represents one line transcribed of text and is a dictionary with these keys:
clusters_x : the x position of each identified word
clusters_y : the y position of each identified word
clusters_text : A list of lists containing the text at each cluster position There is one list for each identified word, and each of those lists contains one item for each user that identified the cluster. If the user did not transcribe the word an empty string is used.
line_slope: The slope of the line of text in degrees
number_views : The number of users that transcribed the line of text
consensus_score : The average number of users who’s text agreed for the line Note, if consensus_score is the same a number_views every user agreed with each other
user_ids: List of panoptes user ids in the same order as clusters_text
gold_standard: List of bools indicating of the if a transcription was made in frontends gold standard mode
slope_label: integer indicating what slope cluster the line belongs to
gutter_label: integer indicating what gutter cluster (i.e. column) the line belongs to
low_consensus : True if the consensus_score is less than the threshold set by the low_consensus_threshold keyword
For the entire subject the following is also returned: * low_consensus_lines : The number of lines with low consensus * transcribed_lines : The total number of lines transcribed on the subject
Note: the image coordinate system has y increasing downward.
- Return type:
- panoptes_aggregation.reducers.optics_line_text_reducer.process_data(data_list, min_line_length=0.0)
Process a list of extractions into a dictionary organized by frame
- Parameters:
data_list (list) – A list of extractions created by
panoptes_aggregation.extractors.poly_line_text_extractor.poly_line_text_extractor()
- Returns:
processed_data – A dictionary with one key for each frame of the subject. The value for each key is a dictionary with two keys X and data. X is a 2D array with each row mapping to the data held in data. The first column contains row indices and the second column is an index assigned to each user. data is a list of dictionaries of the form {‘x’: [start_x, end_x], ‘y’: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’: bool}.
- Return type:
Text Tool Reducer
This module provides functions to reducer the panoptes text tool into an alignment table.
- panoptes_aggregation.reducers.text_reducer.process_data(data_list)
Flatten list of extracts into a list of strings. Empty strings are not returned
- panoptes_aggregation.reducers.text_reducer.text_reducer(data_in, **kwargs)
Reduce a list of text into an alignment table :Parameters: data (list) – A list of strings to be aligned
- Returns:
reduction – A dictionary with the following keys:
aligned_text: A list of lists containing the aligned text. There is one list for each identified word, and each of those lists contains one item for each user that entered text. If the user did not transcribe a word an empty string is used.
number_views: Number of volunteers who entered non-blank text
consensus_score: The average number of users who’s text agreed. Note, if consensus_score is the same a number_views every user agreed with each other
- Return type:
First N True Reducer
This module is designed to reduce boolean-valued extracts e.g.
panoptes_aggregation.extractors.all_tasks_empty_extractor
.
It returns true if and only if the first N extracts are True.
- panoptes_aggregation.reducers.first_n_true_reducer.first_n_true_reducer(data_list, n=0, **kwargs)
Reduce a list of boolean values to a single boolean value.
- Parameters:
data_list (list) – A list of dicts containing a “result” key which should correspond with a boolean value.
n (int) – The first n results in data_list must be True.
- Returns:
reduction – reduction[“result”] is True if the first n results in data_list are True. Otherwise False.
- Return type: