• 제목/요약/키워드: t-SNE

검색결과 42건 처리시간 0.025초

t-SNE에 대한 요약 (A review on the t-distributed stochastic neighbors embedding)

  • 김기풍;김충락
    • 응용통계연구
    • /
    • 제36권2호
    • /
    • pp.167-173
    • /
    • 2023
  • 본 논문에서는 고차원의 자료를 저차원으로 변환시켜 시각화하는 다양한 방법들을 소개하였다. 차원 축소는 크게 선형 방법과 비선형 방법으로 나눌 수 있는데 선형 방법으로 주성분 분석, 다차원 척도 등을 간략하게 소개하였고 비선형 방법으로 커널 주성분 분석, 자기조직도, 국소 선형 사상, Isomap, 국소 다차원 척도 등을 간략하게 소개하였으며, 가장 최근에 제안되었으며 매우 널리 사용되고 있지만 통계학 분야에는 비교적 생소한 t-SNE에 대하여 자세히 소개하였다. t-SNE를 이용한 간단한 예제를 제시하고 t-SNE의 장단점을 지적한 최근 연구 논문을 소개하고 제시된 향후 연구 과제들을 살펴보았다.

우수 의약품 제조 기준 위반 패턴 인식을 위한 연관규칙과 텍스트 마이닝 기반 t-SNE분석 (Violation Pattern Analysis for Good Manufacturing Practice for Medicine using t-SNE Based on Association Rule and Text Mining)

  • 이준오;손소영
    • 품질경영학회지
    • /
    • 제50권4호
    • /
    • pp.717-734
    • /
    • 2022
  • Purpose: The purpose of this study is to effectively detect violations that occur simultaneously against Good Manufacturing Practice, which were concealed by drug manufacturers. Methods: In this study, we present an analysis framework for analyzing regulatory violation patterns using Association Rule Mining (ARM), Text Mining, and t-distributed Stochastic Neighbor Embedding (t-SNE) to increase the effectiveness of on-site inspection. Results: A number of simultaneous violation patterns was discovered by applying Association Rule Mining to FDA's inspection data collected from October 2008 to February 2022. Among them there were 'concurrent violation patterns' derived from similar regulatory ranges of two or more regulations. These patterns do not help to predict violations that simultaneously appear but belong to different regulations. Those unnecessary patterns were excluded by applying t-SNE based on text-mining. Conclusion: Our proposed approach enables the recognition of simultaneous violation patterns during the on-site inspection. It is expected to decrease the detection time by increasing the likelihood of finding intentionally concealed violations.

Cluster Analysis of Daily Electricity Demand with t-SNE

  • Min, Yunhong
    • 한국컴퓨터정보학회논문지
    • /
    • 제23권5호
    • /
    • pp.9-14
    • /
    • 2018
  • For an efficient management of electricity market and power systems, accurate forecasts for electricity demand are essential. Since there are many factors, either known or unknown, determining the realized loads, it is difficult to forecast the demands with the past time series only. In this paper we perform a cluster analysis on electricity demand data collected from Jan. 2000 to Dec. 2017. Our purpose of clustering on electricity demand data is that each cluster is expected to consist of data whose latent variables are same or similar values. Then, if properly clustered, it is possible to develop an accurate forecasting model for each cluster separately. To validate the feasibility of this approach for building better forecasting models, we clustered data with t-SNE. To apply t-SNE to time series data effectively, we adopt the dynamic time warping as a similarity measure. From the result of experiments, we found that several clusters are well observed and each cluster can be interpreted as a mix of well-known factors such as trends, seasonality and holiday effects and other unknown factors. These findings can motivate the approaches which build forecasting models with respect to each cluster independently.

Extra-tidal stars around globular clusters NGC 5024 and NGC 5053 and their chemical abundances

  • Chun, Sang-Hyun;Lee, Jae-Joon
    • 천문학회보
    • /
    • 제43권2호
    • /
    • pp.40.2-40.2
    • /
    • 2018
  • NGC 5024 and NGC 5053 are among the most metal-poor globular clusters in the Milky Way. Both globular clusters are considered to be accreted from dwarf galaxies (like Sagittarius dwarf galaxy or Magellanic clouds), and common stellar envelope and tidal tails between globular clusters are also detected. We present a search for extra-tidal cluster member candidates around these globular clusters from APOGEE survey data. Using 20 chemical elements (e.g., Fe, C, Mg, Al) and radial velocities, t-distributed stochastic neighbour embedding (t-SNE), which identifies an optimal mapping of a high-dimensional space into fewer dimensions, was explored, and we find that globular cluster stars are well separated from the field stars in 2-dimensional map from t-SNE. We also find that some stars selected in t-SNE map are placed outside of the tidal radius of the clusters. The proper motion of stars outside tidal radius is also comparable to that of globular clusters, which suggest that these stars are tidally decoupled from the globular clusters. We manually measure chemical abundances for the clusters and extra-tidal stars, and discuss the association of extra-tidal stars with the clusters.

  • PDF

생체신호 기반의 T-SNE 를 활용한 대화 내 감정 인식 (Physiological Signal-Based Emotion Recognition in Conversations Using T-SNE)

  • 임수빈;이병천 ;문지훈
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2023년도 춘계학술발표대회
    • /
    • pp.703-705
    • /
    • 2023
  • 본 연구는 대화 중 생체신호 데이터를 활용하여 감정 인식 분야에서 더욱 정확하고 범용성이 높은 인식 기술을 제안한다. 이를 위해, 먼저 대화별 길이에 따른 측정값의 개수를 동일하게 조정하고 효과적인 생체신호 데이터의 조합을 비교 및 분석하기 위해 차원 축소 기법인 T-SNE (T-distributed Stochastic Neighbor Embedding)을 활용하여 감정 라벨의 분포를 확인한다. 또한, AutoML (Automated Machine Learning)을 이용하여 축소된 데이터로 감정을 분류 및 각성도와 긍정도를 예측하여 감정을 가장 잘 인식하는 생체신호 데이터의 조합을 발견한다.

Phenolic Composition, Fermentation Profile, Protozoa Population and Methane Production from Sheanut (Butryospermum Parkii) Byproducts In vitro

  • Bhatta, Raghavendra;Mani, Saravanan;Baruah, Luna;Sampath, K.T.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • 제25권10호
    • /
    • pp.1389-1394
    • /
    • 2012
  • Sheanut cake (SNC), expeller (SNE) and solvent extractions (SNSE) samples were evaluated to determine their suitability in animal feeding. The CP content was highest in SNSE (16.2%) followed by SNE (14.7%) and SNC (11.6%). However, metabolizable energy (ME, MJ/kg) was maximum in SNC (8.2) followed by SNE (7.9) and SNSE (7.0). The tannin phenol content was about 7.0 per cent and mostly in the form of hydrolyzable tannin (HT), whereas condensed tannin (CT) was less than one per cent. The in vitro gas production profiles indicated similar y max (maximum potential of gas production) among the 3 by-products. However, the rate of degradation (k) was maximum in SNC followed by SNE and SNSE. The $t^{1/2}$ (time taken for reaching half asymptote) was lowest in SNC (14.4 h) followed by SNE (18.7 h) and SNSE (21.9 h). The increment in the in vitro gas volume (ml/200 mg DM) with PEG (polyethylene glycol)-6000 (as a tannin binder) addition was 12.0 in SNC, 9.6 in SNE and 11.0 in SNSE, respectively. The highest ratio of $CH_4$ (ml) reduction per ml of the total gas, an indicator of the potential of tannin, was recorded in SNE (0.482) followed by SNC (0.301) and SNSE (0.261). There was significant (p<0.05) reduction in entodinia population and total protozoa population. Differential protozoa counts revealed that Entodinia populations increased to a greater extent than Holotricha when PEG was added. This is the first report on the antimethanogenic property of sheanut byproducts. It could be concluded that all the three forms of SN byproducts are medium source of protein and energy for ruminants. There is a great potential for SN by-products to be incorporated in ruminant feeding not only as a source of energy and protein, but also to protect the protein from rumen degradation and suppress enteric methanogenesis.

MEDLINE 검색을 통한 산업안전보건 분야에서의 인간공학 연구동향 : 워드임베딩을 활용한 초록 단어 모델링을 중심으로 (Research Trends of Ergonomics in Occupational Safety and Health through MEDLINE Search: Focus on Abstract Word Modeling using Word Embedding)

  • 김준희;황의재;안선희;곽경태;정성훈
    • 한국안전학회지
    • /
    • 제36권5호
    • /
    • pp.61-70
    • /
    • 2021
  • This study aimed to analyze the research trends of the abstract data of ergonomic studies registered in MEDLINE, a medical bibliographic database, using word embedding. Medical-related ergonomic studies mainly focus on work-related musculoskeletal disorders, and there are no studies on the analysis of words as data using natural language processing techniques, such as word embedding. In this study, the abstract data of ergonomic studies were extracted with a program written with selenium and BeutifulSoup modules using python. The word embedding of the abstract data was performed using the word2vec model, after which the data found in the abstract were vectorized. The vectorized data were visualized in two dimensions using t-Distributed Stochastic Neighbor Embedding (t-SNE). The word "ergonomics" and ten of the most frequently used words in the abstract were selected as keywords. The results revealed that the most frequently used words in the abstract of ergonomics studies include "use", "work", and "task". In addition, the t-SNE technique revealed that words, such as "workplace", "design", and "engineering," exhibited the highest relevance to ergonomics. The keywords observed in the abstract of ergonomic studies using t-SNE were classified into four groups. Ergonomics studies registered with MEDLINE have investigated the risk factors associated with workers performing an operation or task using tools, and in this study, ergonomics studies were identified by the relationship between keywords using word embedding. The results of this study will provide useful and diverse insights on future research direction on ergonomic studies.

Decision support system for underground coal pillar stability using unsupervised and supervised machine learning approaches

  • Kamran, Muhammad;Shahani, Niaz Muhammad;Armaghani, Danial Jahed
    • Geomechanics and Engineering
    • /
    • 제30권2호
    • /
    • pp.107-121
    • /
    • 2022
  • Coal pillar assessment is of broad importance to underground engineering structure, as the pillar failure can lead to enormous disasters. Because of the highly non-linear correlation between the pillar failure and its influential attributes, conventional forecasting techniques cannot generate accurate outcomes. To approximate the complex behavior of coal pillar, this paper elucidates a new idea to forecast the underground coal pillar stability using combined unsupervised-supervised learning. In order to build a database of the study, a total of 90 patterns of pillar cases were collected from authentic engineering structures. A state-of-the art feature depletion method, t-distribution symmetric neighbor embedding (t-SNE) has been employed to reduce significance of actual data features. Consequently, an unsupervised machine learning technique K-mean clustering was followed to reassign the t-SNE dimensionality reduced data in order to compute the relative class of coal pillar cases. Following that, the reassign dataset was divided into two parts: 70 percent for training dataset and 30 percent for testing dataset, respectively. The accuracy of the predicted data was then examined using support vector classifier (SVC) model performance measures such as precision, recall, and f1-score. As a result, the proposed model can be employed for properly predicting the pillar failure class in a variety of underground rock engineering projects.

불균형 데이터 처리를 통한 침입탐지 성능향상에 관한 연구 (A study on intrusion detection performance improvement through imbalanced data processing)

  • 정일옥;지재원;이규환;김묘정
    • 융합보안논문지
    • /
    • 제21권3호
    • /
    • pp.57-66
    • /
    • 2021
  • 침입탐지 분야에서 딥러닝과 머신러닝을 이용한 탐지성능이 검증되면서 이를 활용한 사례가 나날이 증가하고 있다. 하지만, 학습에 필요한 데이터 수집이 어렵고, 수집된 데이터의 불균형으로 인해 머신러닝 성능이 현실에 적용되는데 어려움이 있다. 본 논문에서는 이에 대한 해결책으로 불균형 데이터 처리를 위해 t-SNE 시각화를 이용한 혼합샘플링 기법을 제안한다. 이를 위해 먼저, 페이로드를 포함한 침입탐지 이벤트에 대해서 특성에 맞게 필드를 분리한다. 분리된 필드에 대해 TF-IDF 기반의 피처를 추출한다. 추출된 피처를 기반으로 혼합샘플링 기법을 적용 후 t-SNE를 이용한 데이터 시각화를 통해 불균형 데이터가 처리된 침입탐지에 최적화된 데이터셋을 얻게 된다. 공개 침입탐지 데이터셋 CSIC2012를 통해 9가지 샘플링 기법을 적용하였으며, 제안한 샘플링 기법이 F-score, G-mean 평가 지표를 통해 탐지성능이 향상됨을 검증하였다.

비정형 텍스트 데이터 분석을 활용한 기록관리 분야 연구동향 (Research Trends in Record Management Using Unstructured Text Data Analysis)

  • 홍덕용;허준석
    • 한국기록관리학회지
    • /
    • 제23권4호
    • /
    • pp.73-89
    • /
    • 2023
  • 본 연구에서는 텍스트 마이닝 기법을 활용하여 국내 기록관리 연구 분야의 비정형 텍스트 데이터인 국문 초록에서 사용된 키워드 빈도를 분석하여 키워드 간 거리 분석을 통해 국내기록관리 연구 동향을 파악하는 것이 목적이다. 이를 위해 한국학술지인용색인(Korea Citation Index, KCI)의 학술지 기관통계(등재지, 등재후보지)에서 대분류(복합학), 중분류 (문헌정보학)으로 검색된 학술지(28종) 중 등재지 7종 1,157편을 추출하여 77,578개의 키워드를 시각화하였다. Word2vec를 활용한 t-SNE, Scattertext 등의 분석을 수행하였다. 분석 결과, 첫째로 1,157편의 논문에서 얻은 77,578개의 키워드를 빈도 분석한 결과, "기록관리" (889회), "분석"(888회), "아카이브"(742회), "기록물"(562회), "활용"(449회) 등의 키워드가 연구자들에 의해 주요 주제로 다뤄지고 있음을 확인하였다. 둘째로, Word2vec 분석을 통해 키워드 간의 벡터 표현을 생성하고 유사도 거리를 조사한 뒤, t-SNE와 Scattertext를 활용하여 시각화하였다. 시각화 결과에서 기록관리 연구 분야는 두 그룹으로 나누어졌는데 첫 번째 그룹(과거)에는 "아카이빙", "국가기록관리", "표준화", "공문서", "기록관리제도" 등의 키워드가 빈도가 높게 나타났으며, 두 번째 그룹(현재)에는 "공동체", "데이터", "기록정보서비스", "온라인", "디지털 아카이브" 등의 키워드가 주요한 관심을 받고 있는 것으로 나타났다.