特征工程

学习目标

学习时间序列数据的特征预处理方法

学习时间序列特征处理工具Tsfresh(TimeSeries Fresh)的使用

数据预处理

时间序列数据格式处理、加入时间步特征time

特征工程

时间序列特征构造、特征筛选、使用tsfresh进行时间序列特征处理

# 库函数导入
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tsfresh as tsf
from tsfresh import extract_features,select_features
from tsfresh.utilities.dataframe_functions import impute
#数据读取
data_train = pd.read_csv("train.csv")
data_test = pd.read_csv("testA.csv")
print(data_train.shape)
print(data_test.shape)
(100000, 3)
(20000, 2)
data_train.head()
idheartbeat_signalslabel
000.9912297987616655,0.9435330436439665,0.764677...0.0
110.9714822034884503,0.9289687459588268,0.572932...0.0
221.0,0.9591487564065292,0.7013782792997189,0.23...2.0
330.9757952826275774,0.9340884687738161,0.659636...0.0
440.0,0.055816398940721094,0.26129357194994196,0...2.0
data_test.head()
idheartbeat_signals
01000000.9915713654170097,1.0,0.6318163407681274,0.13...
11000010.6075533139615096,0.5417083883163654,0.340694...
21000020.9752726292239277,0.6710965234906665,0.686758...
31000030.9956348033996116,0.9170249621481004,0.521096...
41000041.0,0.8879490481178918,0.745564725322326,0.531...

数据预处理

# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",",expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"},inplace=True)
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
train_heartbeat_df
timeheartbeat_signals
000.991230
010.943533
020.764677
030.618571
040.379632
.........
999992000.000000
999992010.000000
999992020.000000
999992030.000000
999992040.000000

20500000 rows × 2 columns

# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train = data_train.drop("label",axis=1)
data_train = data_train.drop("heartbeat_signals",axis=1)
data_train = data_train.join(train_heartbeat_df)

data_train
idtimeheartbeat_signals
0000.991230
0010.943533
0020.764677
0030.618571
0040.379632
............
99999999992000.000000
99999999992010.000000
99999999992020.000000
99999999992030.000000
99999999992040.000000

20500000 rows × 3 columns

data_train[data_train["id"]==1]
idtimeheartbeat_signals
1100.971482
1110.928969
1120.572933
1130.178457
1140.122962
............
112000.000000
112010.000000
112020.000000
112030.000000
112040.000000

205 rows × 3 columns

使用tsfresh进行时间序列特征处理

特征抽取Tsfresh自动计算大量的时间序列数据的特征

from tsfresh import extract_features
# 特征提取
train_features = extract_features(data_train,column_id='id',column_sort='time')
train_features
Feature Extraction: 100%|██████████| 40/40 [1:12:34<00:00, 108.85s/it]
heartbeat_signals__variance_larger_than_standard_deviationheartbeat_signals__has_duplicate_maxheartbeat_signals__has_duplicate_minheartbeat_signals__has_duplicateheartbeat_signals__sum_valuesheartbeat_signals__abs_energyheartbeat_signals__mean_abs_changeheartbeat_signals__mean_changeheartbeat_signals__mean_second_derivative_centralheartbeat_signals__median...heartbeat_signals__permutation_entropy__dimension_5__tau_1heartbeat_signals__permutation_entropy__dimension_6__tau_1heartbeat_signals__permutation_entropy__dimension_7__tau_1heartbeat_signals__query_similarity_count__query_None__threshold_0.0heartbeat_signals__matrix_profile__feature_"min"__threshold_0.98heartbeat_signals__matrix_profile__feature_"max"__threshold_0.98heartbeat_signals__matrix_profile__feature_"mean"__threshold_0.98heartbeat_signals__matrix_profile__feature_"median"__threshold_0.98heartbeat_signals__matrix_profile__feature_"25"__threshold_0.98heartbeat_signals__matrix_profile__feature_"75"__threshold_0.98
00.00.01.01.038.92794518.2161970.019894-0.0048590.0001170.125531...2.1844202.5006582.722686NaN6.44554612.16552510.24652410.7469928.38862511.484910
10.00.01.01.019.4456347.7050920.019952-0.0047620.0001050.030481...2.7109333.0658023.224835NaN3.20914012.6491119.0310699.4375456.72318012.094899
20.00.01.01.021.1929749.1404230.009863-0.0049020.0001010.000000...1.2633701.4060011.509478NaN3.0545398.2462117.3704788.2462115.9661228.246211
30.00.01.01.042.11306615.7576230.018743-0.0047830.0001030.241397...2.9867283.5343543.854177NaN3.0105579.7979596.3313606.4064405.2667437.091706
40.00.01.01.069.75678651.2296160.0145140.000000-0.0001370.000000...1.9145112.1656272.323993NaN9.18123613.4297849.9599139.5162909.28601310.270925
..................................................................
999950.00.01.01.063.32344928.7422380.023588-0.0049020.0007940.388402...2.8736023.3918303.679969NaN2.4363779.5916635.6352316.3662053.5969827.033638
999960.00.01.01.069.65753431.8663230.017373-0.0045430.0000510.421138...3.0855043.7288814.095457NaN1.4154107.4833152.8935922.6843492.0492413.334109
999970.00.01.01.040.89705716.4128570.019470-0.0045380.0008340.213306...2.6010622.9969623.293562NaN5.74865212.1655258.5246377.9834107.06221710.081756
999980.00.01.01.042.33330314.2812810.017032-0.0049020.0000130.264974...3.2369503.7935124.018302NaN2.3468228.2462114.9513744.7275354.0697865.615282
999990.00.01.01.053.29011721.6374710.021870-0.0045390.0000230.320124...2.9492663.4625493.688612NaN1.9591399.3808324.5736913.9086213.0946145.916164

100000 rows × 787 columns

特征选择train_features中包含了heartbeat_signals的787的常见的时间序列特征,其中特征可能为NaN,使用以下方式去除NaN

from tsfresh.utilities.dataframe_functions import impute
# 去除抽取特征中的NaN值
impute(train_features)
heartbeat_signals__variance_larger_than_standard_deviationheartbeat_signals__has_duplicate_maxheartbeat_signals__has_duplicate_minheartbeat_signals__has_duplicateheartbeat_signals__sum_valuesheartbeat_signals__abs_energyheartbeat_signals__mean_abs_changeheartbeat_signals__mean_changeheartbeat_signals__mean_second_derivative_centralheartbeat_signals__median...heartbeat_signals__permutation_entropy__dimension_5__tau_1heartbeat_signals__permutation_entropy__dimension_6__tau_1heartbeat_signals__permutation_entropy__dimension_7__tau_1heartbeat_signals__query_similarity_count__query_None__threshold_0.0heartbeat_signals__matrix_profile__feature_"min"__threshold_0.98heartbeat_signals__matrix_profile__feature_"max"__threshold_0.98heartbeat_signals__matrix_profile__feature_"mean"__threshold_0.98heartbeat_signals__matrix_profile__feature_"median"__threshold_0.98heartbeat_signals__matrix_profile__feature_"25"__threshold_0.98heartbeat_signals__matrix_profile__feature_"75"__threshold_0.98
00.00.01.01.038.92794518.2161970.019894-0.0048590.0001170.125531...2.1844202.5006582.7226860.06.44554612.16552510.24652410.7469928.38862511.484910
10.00.01.01.019.4456347.7050920.019952-0.0047620.0001050.030481...2.7109333.0658023.2248350.03.20914012.6491119.0310699.4375456.72318012.094899
20.00.01.01.021.1929749.1404230.009863-0.0049020.0001010.000000...1.2633701.4060011.5094780.03.0545398.2462117.3704788.2462115.9661228.246211
30.00.01.01.042.11306615.7576230.018743-0.0047830.0001030.241397...2.9867283.5343543.8541770.03.0105579.7979596.3313606.4064405.2667437.091706
40.00.01.01.069.75678651.2296160.0145140.000000-0.0001370.000000...1.9145112.1656272.3239930.09.18123613.4297849.9599139.5162909.28601310.270925
..................................................................
999950.00.01.01.063.32344928.7422380.023588-0.0049020.0007940.388402...2.8736023.3918303.6799690.02.4363779.5916635.6352316.3662053.5969827.033638
999960.00.01.01.069.65753431.8663230.017373-0.0045430.0000510.421138...3.0855043.7288814.0954570.01.4154107.4833152.8935922.6843492.0492413.334109
999970.00.01.01.040.89705716.4128570.019470-0.0045380.0008340.213306...2.6010622.9969623.2935620.05.74865212.1655258.5246377.9834107.06221710.081756
999980.00.01.01.042.33330314.2812810.017032-0.0049020.0000130.264974...3.2369503.7935124.0183020.02.3468228.2462114.9513744.7275354.0697865.615282
999990.00.01.01.053.29011721.6374710.021870-0.0045390.0000230.320124...2.9492663.4625493.6886120.01.9591399.3808324.5736913.9086213.0946145.916164

100000 rows × 787 columns

按照特征和响应变量之间的相关性进行特征选择,首先单独计算每个特征和响应变量之间的相关性,然后利用Benjamini-Yekutieli procedure进行特征选择

from tsfresh import select_features
# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features,data_train_label)
train_features_filtered
heartbeat_signals__sum_valuesheartbeat_signals__fft_coefficient__attr_"abs"__coeff_38heartbeat_signals__fft_coefficient__attr_"abs"__coeff_37heartbeat_signals__fft_coefficient__attr_"abs"__coeff_36heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30...heartbeat_signals__fft_coefficient__attr_"abs"__coeff_84heartbeat_signals__fft_coefficient__attr_"imag"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_90heartbeat_signals__fft_coefficient__attr_"abs"__coeff_94heartbeat_signals__fft_coefficient__attr_"abs"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_75heartbeat_signals__fft_coefficient__attr_"real"__coeff_88heartbeat_signals__fft_coefficient__attr_"real"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_83
038.9279450.6609491.0907090.8487281.1686850.9821331.2234961.2363001.1041721.497129...0.531883-0.0474380.5543700.3075860.5645960.5629600.5918590.5041240.5284500.473568
119.4456341.7182171.2809231.8507061.4607521.9245011.9254851.7159382.0799571.818636...0.563590-0.1095790.6974460.3980730.6409690.2701920.2249250.6450820.6351350.297325
221.1929741.8142811.6190511.2153431.7871662.1469871.6861901.5401372.2910312.403422...0.712487-0.0740420.3217030.3903860.7169290.3165240.4220770.7227420.6805900.383754
342.1130662.1095500.6196342.3664132.0715391.0003402.7282811.3917272.0171762.610492...0.601499-0.1842480.5646690.6233530.4669800.6517740.3089150.5500970.4669040.494024
469.7567860.1945490.3488820.0921190.6539240.2314221.0800030.7112441.3579041.237998...0.0152920.0705050.0658350.0517800.0929400.1037730.179405-0.0896110.0918410.056867
..................................................................
9999563.3234490.8406511.1862101.3962360.4172212.0360341.6590540.5005841.6935450.859932...0.7799550.0055250.4860130.2733720.7053860.6028980.4479290.4748440.5642660.133969
9999669.6575341.5577871.3939600.9891471.6113331.7930441.0923250.5071381.7639402.677643...0.5394890.1146700.5794980.4172260.2701100.5565960.7032580.4623120.2697190.539236
9999740.8970570.4697581.0003550.7063951.1905140.6746031.6327690.2290082.0278020.302457...0.282597-0.4746290.4606470.4783410.5278910.9041110.7285290.1784100.5008130.773985
9999842.3333030.9929481.3548942.2385891.2376081.3252122.7855151.9185710.8141672.613950...0.594252-0.1621060.6942760.6810250.3571960.4980880.4332970.4061540.3247710.340727
9999953.2901171.6246251.7390882.9365550.1547592.9211642.1839321.4851502.6859220.583443...0.4636970.2893640.2853210.4221030.6920090.2762360.2457800.2695190.681719-0.053993

100000 rows × 707 columns

Logo

永洪科技,致力于打造全球领先的数据技术厂商,具备从数据应用方案咨询、BI、AIGC智能分析、数字孪生、数据资产、数据治理、数据实施的端到端大数据价值服务能力。

更多推荐