///
This page details the crucial data transformation mechanisms within Navi, focusing on its `segdense` and `dr_transform` components. These components are essential for preparing incoming data for machi
60 views
~60 views from guests
Guest views are estimated from total page views. These include anonymous visitors and users who weren't logged in when they viewed the page.
This page details the crucial data transformation mechanisms within Navi, focusing on its segdense and dr_transform components. These components are essential for preparing incoming data for machine learning model inference, acting as a bridge between raw feature data and the structured inputs expected by models. For an overview of Navi's role, refer to [Navi - ML Model Serving Framework].
The segdense component plays a foundational role in Navi by parsing JSON schema files and generating a FeatureMapper. This mapper translates abstract feature identifiers into concrete locations within the model's input tensors.
segdense is responsible for loading and interpreting a JSON schema that describes the expected structure of features for a machine learning model. An example schema, segdense_transform_spec_home_recap_2022.rs, defines the data structures (Root, DensificationTransformSpec, InputFeature, ComplexFeatureTypeTransformSpec) that represent this complex configuration in Rust.
The parsing process typically begins with a JSON string read from a schema file (e.g., json/compact.json as seen in navi/segdense/src/main.rs). The segdense::util::parse function then deserializes this string into a seg_dense::Root struct (navi/segdense/src/util.rs).
The core output of segdense's parsing is a FeatureMapper (defined in navi/segdense/src/mapper.rs). This is essentially a hash map (HashMap<i64, FeatureInfo>) that provides a lookup mechanism:
feature-id: A 64-bit hash of the feature's name, as used in DataRecords.FeatureInfo: A struct containing two key pieces of information:
tensor_index: An i8 indicating which input tensor (e.g., continuous, binary, embedding) this feature belongs to.index_within_tensor: An i64 specifying the exact position or index for the feature's value within its assigned tensor.The segdense::util::load_from_parsed_config function (navi/segdense/src/util.rs) iterates through the parsed seg_dense::Root object's input_features_map. For each input feature, it calls segdense::util::to_feature_info to generate the FeatureInfo and then populates the FeatureMapper using fm.set(feature_id, info).
The segdense::util::to_feature_info function (navi/segdense/src/util.rs) contains model-specific logic for mapping features. This function makes several assumptions:
maybe_exclude are explicitly ignored (return None).feature_id values are hardcoded to map to particular embedding tensors. For instance:
-2550691008059411095 maps to tensor_index = 3 (likely for user.timelines.twhin_user_follow_embeddings).5390650078733277231 maps to tensor_index = 4 (likely for user.timelines.twhin_user_engagement_embeddings).3223956748566688423 maps to tensor_index = 5 (likely for original_author.timelines.twhin_author_follow_embeddings).tensor_index is derived from feature_type (as defined in src/thrift/com/twitter/ml/api/data.thrift):
1 (BINARY) maps to tensor_index = 1.2 (CONTINUOUS) maps to tensor_index = 0.3 (DISCRETE) maps to tensor_index = 2.These mappings assume a fixed order of input tensors ([Continuous, Binary, Discrete, User_embedding, user_eng_embedding, author_embedding]) as expected by the downstream ML model. If input_feature.index is negative or tensor_idx remains -1 after these checks, the feature is also excluded.
The dr_transform component is responsible for converting raw incoming data, typically in the form of BatchPredictionRequest Thrift objects, into the InputTensor format required by the ONNX runtime for model inference.
dr_transform receives data as Vec<Vec<u8>>, where each inner Vec<u8> is a serialized BatchPredictionRequest Thrift object. These Thrift objects contain DataRecords, which encapsulate various feature types (continuous, binary, discrete, string, blob) and dense/sparse tensors (including embeddings).(Vec<InputTensor>, Vec<usize>).
Vec<InputTensor>: A collection of tensors directly consumable by the ONNX runtime. These can be InputTensor::FloatTensor or InputTensor::Int64Tensor.Vec<usize>: A list of batch end indices, used to reconstruct the original per-request batching after the model inference.BatchPredictionRequestToTorchTensorConverterThe BatchPredictionRequestToTorchTensorConverter (defined in navi/dr_transform/src/converter.rs) is the primary implementation of the Converter trait, orchestrating this transformation.
Its new method initializes the converter by:
all_config.json (via all_config::parse) for overall model configuration.segdense_transform_spec_home_recap_2022.json (re-using segdense::util::load_config) to obtain the FeatureMapper for feature-to-tensor mapping.feature_ids for key embeddings (e.g., user_embedding, user_eng_embedding, author_embedding) from the all_config.The core conversion logic resides in the convert method:
Vec<Vec<u8>> into a Vec<BatchPredictionRequest> using parse_batch_prediction_request. This helper function uses TBinaryInputProtocol to read the Thrift object from bytes.DataRecords from all BatchPredictionRequests into a single logical batch, while keeping track of the original request boundaries in batch_ends.InputTensors:
get_continuous: This method builds a large Array2<f32> (e.g., [rows, 5293] dimensions). It iterates through both common_features.continuous_features (features common to all DataRecords in a BatchPredictionRequest) and individual_features_list.continuous_features. It uses the FeatureMapper (populated by segdense) to determine the index_within_tensor for each feature and places its f32 value accordingly.get_binary: Similar to get_continuous, this method processes binary features and populates an Array2<i64> (e.g., [rows, 149] dimensions), setting the value to 1 if the feature is present.get_embedding_tensors: This helper function extracts dense embeddings. It expects embedding features to be present as GeneralTensor::FloatTensor within the tensors field of common_features or individual DataRecords. It consolidates these into an Array2<f32> (e.g., [rows, 200] for typical embeddings).get_user_embedding, get_eng_embedding, get_author_embedding: These methods internally call get_embedding_tensors with their respective feature_ids to extract the specific user, user engagement, and author embeddings.Finally, convert returns a Vec<InputTensor> containing the processed continuous, binary, and embedding features, ready for the ONNX runtime. The specific tensor types used are InputTensor::FloatTensor for continuous and embedding data, and InputTensor::Int64Tensor for binary data. This detailed transformation ensures that complex, serialized data is efficiently prepared for high-performance ML inference.