Navi - Data Transformation (SegDense & DR_Transform)

This page details the crucial data transformation mechanisms within Navi, focusing on its segdense and dr_transform components. These components are essential for preparing incoming data for machine learning model inference, acting as a bridge between raw feature data and the structured inputs expected by models. For an overview of Navi's role, refer to [Navi - ML Model Serving Framework].

SegDense: Schema Parsing and Feature Mapping

The segdense component plays a foundational role in Navi by parsing JSON schema files and generating a FeatureMapper. This mapper translates abstract feature identifiers into concrete locations within the model's input tensors.

Role and JSON Schema Parsing

segdense is responsible for loading and interpreting a JSON schema that describes the expected structure of features for a machine learning model. An example schema, segdense_transform_spec_home_recap_2022.rs, defines the data structures (Root, DensificationTransformSpec, InputFeature, ComplexFeatureTypeTransformSpec) that represent this complex configuration in Rust.

The parsing process typically begins with a JSON string read from a schema file (e.g., json/compact.json as seen in navi/segdense/src/main.rs). The segdense::util::parse function then deserializes this string into a seg_dense::Root struct (navi/segdense/src/util.rs).

FeatureMapper and FeatureInfo

The core output of segdense's parsing is a FeatureMapper (defined in navi/segdense/src/mapper.rs). This is essentially a hash map (HashMap<i64, FeatureInfo>) that provides a lookup mechanism:

feature-id: A 64-bit hash of the feature's name, as used in DataRecords.
FeatureInfo: A struct containing two key pieces of information:
- tensor_index: An i8 indicating which input tensor (e.g., continuous, binary, embedding) this feature belongs to.
- index_within_tensor: An i64 specifying the exact position or index for the feature's value within its assigned tensor.

The segdense::util::load_from_parsed_config function (navi/segdense/src/util.rs) iterates through the parsed seg_dense::Root object's input_features_map. For each input feature, it calls segdense::util::to_feature_info to generate the FeatureInfo and then populates the FeatureMapper using fm.set(feature_id, info).

Hardcoded Model-Specific Assumptions

The segdense::util::to_feature_info function (navi/segdense/src/util.rs) contains model-specific logic for mapping features. This function makes several assumptions:

Exclusion Logic: Features marked maybe_exclude are explicitly ignored (return None).
Embedding Tensor Indices: Specific feature_id values are hardcoded to map to particular embedding tensors. For instance:
- -2550691008059411095 maps to tensor_index = 3 (likely for user.timelines.twhin_user_follow_embeddings).
- 5390650078733277231 maps to tensor_index = 4 (likely for user.timelines.twhin_user_engagement_embeddings).
- 3223956748566688423 maps to tensor_index = 5 (likely for original_author.timelines.twhin_author_follow_embeddings).
Generic Feature Type Mapping: For other features, tensor_index is derived from feature_type (as defined in src/thrift/com/twitter/ml/api/data.thrift):
- 1 (BINARY) maps to tensor_index = 1.
- 2 (CONTINUOUS) maps to tensor_index = 0.
- 3 (DISCRETE) maps to tensor_index = 2.

These mappings assume a fixed order of input tensors ([Continuous, Binary, Discrete, User_embedding, user_eng_embedding, author_embedding]) as expected by the downstream ML model. If input_feature.index is negative or tensor_idx remains -1 after these checks, the feature is also excluded.

DR_Transform: Data Conversion for ONNX Runtime

The dr_transform component is responsible for converting raw incoming data, typically in the form of BatchPredictionRequest Thrift objects, into the InputTensor format required by the ONNX runtime for model inference.

Input and Output Formats

Input: dr_transform receives data as Vec<Vec<u8>>, where each inner Vec<u8> is a serialized BatchPredictionRequest Thrift object. These Thrift objects contain DataRecords, which encapsulate various feature types (continuous, binary, discrete, string, blob) and dense/sparse tensors (including embeddings).
Output: The component produces a tuple (Vec<InputTensor>, Vec<usize>).
- Vec<InputTensor>: A collection of tensors directly consumable by the ONNX runtime. These can be InputTensor::FloatTensor or InputTensor::Int64Tensor.
- Vec<usize>: A list of batch end indices, used to reconstruct the original per-request batching after the model inference.

`BatchPredictionRequestToTorchTensorConverter`

The BatchPredictionRequestToTorchTensorConverter (defined in navi/dr_transform/src/converter.rs) is the primary implementation of the Converter trait, orchestrating this transformation.

Its new method initializes the converter by:

Loading all_config.json (via all_config::parse) for overall model configuration.
Loading segdense_transform_spec_home_recap_2022.json (re-using segdense::util::load_config) to obtain the FeatureMapper for feature-to-tensor mapping.
Identifying specific feature_ids for key embeddings (e.g., user_embedding, user_eng_embedding, author_embedding) from the all_config.
Setting up Prometheus metrics for feature reporting.

The core conversion logic resides in the convert method:

Deserialization: It first deserializes the incoming Vec<Vec<u8>> into a Vec<BatchPredictionRequest> using parse_batch_prediction_request. This helper function uses TBinaryInputProtocol to read the Thrift object from bytes.
Batch Aggregation: It consolidates the individual DataRecords from all BatchPredictionRequests into a single logical batch, while keeping track of the original request boundaries in batch_ends.
Feature Extraction: It then calls several internal methods to extract and format specific feature types into InputTensors:
- get_continuous: This method builds a large Array2<f32> (e.g., [rows, 5293] dimensions). It iterates through both common_features.continuous_features (features common to all DataRecords in a BatchPredictionRequest) and individual_features_list.continuous_features. It uses the FeatureMapper (populated by segdense) to determine the index_within_tensor for each feature and places its f32 value accordingly.
- get_binary: Similar to get_continuous, this method processes binary features and populates an Array2<i64> (e.g., [rows, 149] dimensions), setting the value to 1 if the feature is present.
- get_embedding_tensors: This helper function extracts dense embeddings. It expects embedding features to be present as GeneralTensor::FloatTensor within the tensors field of common_features or individual DataRecords. It consolidates these into an Array2<f32> (e.g., [rows, 200] for typical embeddings).
- get_user_embedding, get_eng_embedding, get_author_embedding: These methods internally call get_embedding_tensors with their respective feature_ids to extract the specific user, user engagement, and author embeddings.

Finally, convert returns a Vec<InputTensor> containing the processed continuous, binary, and embedding features, ready for the ONNX runtime. The specific tensor types used are InputTensor::FloatTensor for continuous and embedding data, and InputTensor::Int64Tensor for binary data. This detailed transformation ensures that complex, serialized data is efficiently prepared for high-performance ML inference.

Navi - Data Transformation (SegDense & DR_Transform)

Page Viewers

Guest Views

Navi - Data Transformation (SegDense & DR_Transform)

SegDense: Schema Parsing and Feature Mapping

Role and JSON Schema Parsing

FeatureMapper and FeatureInfo

Hardcoded Model-Specific Assumptions

DR_Transform: Data Conversion for ONNX Runtime

Input and Output Formats

`BatchPredictionRequestToTorchTensorConverter`