利用R处理PDF文件
pdftools
pdftools是一个专门用来处理pdf文件的包 pdftools
pdf_text()
pdf_text()#将pdf每页返回成(return)成一个character vector.
> #举个例子
> a <- pdf_text("41375_2012_BFleu2012127_MOESM29_ESM.pdf")
> #查看pdf页数
> length(a)
[1] 23
> #看看第一页,不过好像不能读取公式
> a[1]
接着按照自己的需求提取pdf中的信息就好了~
#还是举个例子,我想提取第十页的gene symbol
> b <- pdf_text("41375_2012_BFleu2012127_MOESM29_ESM.pdf")
> #查看pdf页数
> length(b)
[1] 23
> #看看第一页,不过好像不能读取公式
> b[1]
[1] " SUPPLEMENTAL METHODS\n\nPatients in training dataset\nThe HOVON-65/GMMG-HD4 randomized clinical trial (ISRCTN64455289) consists of newly diagnosed,\ntransplant-eligible patients with multiple myeloma. Patients were randomly assigned to either bortezomib based\ntreatment or vincristine based treatment. Vincristine based treatment: three cycles of induction treatment with\nvincristine 0.4 mg intravenously on days 1-4, doxorubicin 9 mg/m² intravenously on days 1-4, and dexamethasone\n40 mg orally on days 1-4, 9-12, and 17-20; bortezomib based treatment: bortezomib 1.3 mg/m² intravenously on\ndays 1, 4, 8, and 11, doxorubicin 9 mg/m² intravenously on days 1-4, and dexamethasone 40 mg orally on days 1-4,\n9-12, and 17-20. Stem-cells were mobilized by use of cyclophosphamide 1000 mg/m² intravenously on day 1,\ndoxorubicin 15 mg/m² intravenously on days 1-4, dexamethasone 40 mg orally on days 1-4, and granulocyte colony-\nstimulating factor (filgrastim) 10 μg/kg per day subcutaneously, divided in two doses per day, from day 5 until last\nstem cell collection. A minimum of 2.5 × 106 CD34+ cells per transplantation procedure was required. After\ninduction therapy, patients received one (HOVON-65) or two (GMMG-HD4) cycles of high-dose melphalan (200\nmg/m² intravenously) with autologous stem-cell rescue followed by maintenance treatment with thalidomide (50 mg\nper day orally; group assigned to vincristine-based induction treatment) or bortezomib (1.3 mg/m² intravenously\nonce every 2 weeks; group assigned to bortezomib-based induction treatment) for 2 years. Treatment was not\nmasked for physicians and patients (see Figure S1).\nInformed consent to treatment protocols and sample procurement was obtained for all cases included in this study, in\naccordance with the Declaration of Helsinki. Use of diagnostic tumour material was approved by the institutional\nreview board of the Erasmus Medical Centre.\n\nPatients in validation datasets\nUAMS-TT2 is a randomized trial in which patients received thalidomide during all treatment phases (UAMS-TT2;\nn=351; GSE2658; NCT00573391).1 UAMS-TT3 is a similar regimen with the addition of bortezomib to the\nthalidomide arm (UAMS-TT3; n=142; E-TABM-1138; NCT00081939).2 The MRC-IX trial (n=247; GSE15695;\nISRCTN68454111) included both transplant-eligible and non-transplant-eligible newly diagnosed patients. For\ntransplant-eligible patients treatment consisted of induction high-dose therapy while non-transplant-eligible patients\nwere treated initially with either thalidomide or melphalan. Maintenance for both age classes was a comparison of\nthalidomide vs. no thalidomide.3, 4 The trial and dataset denoted here as APEX consisted of the three trials APEX,\nSUMMIT and CREST (n=264; GSE9782; registered under M34100-024, M34100-025 and\nNCT00049478/NCT00048230).5-8 The APEX trial included patients with relapsed myeloma who received either\nbortezomib or high-dose dexamethasone, with the possibility to cross-over to receive bortezomib after disease\nprogression.5 In the SUMMIT trial patients received bortezomib. In patients with a suboptimal response, oral\ndexamethasone was added to the regimen.8 The CREST trial included relapsed or refractory patients who received\nbortezomib. Dexamethasone was permitted in patients with progressive or stable disease. 7\n\nSurvival signature\nThe MAS5 normalized, log2 transformed and mean-variance scaled HOVON-65/GMMG-HD4 dataset was used as a\ntraining set for building a GEP based survival classifier.9, 10 The model was built using a Supervised Principal\nComponent Analysis (SPCA) framework.11 This technique is widely used in biological settings.12-19 The underlying\nassumption is the existence of a high-risk group which can be separated from a standard-risk group on the basis of\nprogression free survival. A Principal Component Analysis (PCA) is a rotation of a n m centered feature space\nX in such a way that the largest variance in the data is projected on the top principal components. 20 This rotation\ncan be described by a m m rotation matrix R pca\n\n X rot XR pca\n\nX rot is rotated in such a way that the first principal component (PC) is the axis that points in the direction\nexhibiting the largest variance. Every subsequent PC is perpendicular to all previous while capturing as much as\npossible of the remaining variance. SPCA is a PCA whereby the feature space has undergone a selection X sel . In\nthis study the initial selection is based on selecting the top probe sets that were ranked by a univariate Cox\nproportional hazard regression. This will result in high variance due to survival so it is likely survival is projected\nonto the top PC's on which a Cox proportional hazard regression is applied. This yields regression coefficients β.\nThe resulting model can be summarized as:\n\n\n 1\n"
> #提取第十页信息
> b[10]#杂乱无章!
[1] " SUPPLEMENTAL TABLES\n\nTable S1. EMC-92 gene signature. Probe sets are ordered by decreasing magnitude of weighting coefficient\n(beta)\n Weighting Symbol\nRank Probes GO-term/description1\n coefficient (beta)\n 1 202728_s_at -0.1105 LTBP1 negative regulation of TGFbeta receptor signaling\n 2 239054_at -0.1088 SFMBT1 regulation of transcription\n 3 208942_s_at -0.0997 SEC62 cotranslational protein targeting to membrane\n 4 208747_s_at -0.0874 C1S proteolysis\n 5 202542_s_at 0.0870 AIMP1 negative regulation of endothelial cell proliferation\n 6 214482_at 0.0861 ZBTB25 transcription\n 7 228416_at -0.0778 ACVR2A transmembrane receptor protein serine/threonine kinase signaling\n 8 217728_at 0.0773 S100A6 signal transduction\n 9 215177_s_at -0.0768 ITGA6 cell-substrate junction assembly\n 10 225601_at 0.0750 HMGB3 multicellular organismal development\n 11 207618_s_at 0.0746 BCS1L mitochondrion organization\n 12 231989_s_at 0.0730 LOC100271836 ---\n 13 202884_s_at 0.0714 PPP2R1B control of cell growth and division\n 14 231738_at 0.0686 PCDHB7 calcium-dependent cell-cell adhesion\n 15 238116_at 0.0661 DYNLRB2 microtubule-based movement\n 16 226218_at -0.0644 IL7R regulation of DNA recombination\n 17 202842_s_at -0.0626 DNAJB9 protein folding\n 18 208732_at -0.0618 RAB2A ER to Golgi vesicle-mediated transport\n 19 204379_s_at 0.0594 FGFR3 MAPKKK cascade\n 20 242180_at -0.0585 TSPAN16 cellular activation and adhesion\n 21 216473_x_at -0.0576 DUX4 regulation of transcription, DNA-dependent\n 22 209683_at -0.0561 FAM49A ---\n 23 219550_at 0.0559 ROBO3 axon guidance\n 24 223811_s_at 0.0556 SUN1 / GET4 cytoskeletal anchoring at nuclear membrane\n 25 202813_at 0.0548 TARBP1 regulation of transcription from RNA polymerase II promoter\n 26 212282_at 0.0530 TMEM97 cholesterol homeostasis\n 27 238780_s_at -0.0529 EST/ BX647543 ---\n 28 M97935_MA_at2 0.0525 STAT1 transcription from RNA polymerase II promoter\n 29 221041_s_at -0.0520 SLC17A5 anion transport\n 30 224009_x_at -0.0520 DHRS9 androgen metabolic process\n 31 214612_x_at 0.0496 MAGEA6 ---\n 32 208232_x_at -0.0493 --- ---\n 33 238662_at 0.0490 ATPBD4 ---\n 34 206204_at 0.0477 GRB14 signal transduction\n 35 233437_at 0.0446 GABRA4 transport\n 36 200875_s_at 0.0437 NOP56 rRNA processing\n 37 38158_at 0.0423 ESPL1 apoptosis\n 38 217548_at -0.0423 C15orf38 ---\n 39 220351_at 0.0420 CCRL1 chemotaxis\n 40 213002_at -0.0418 MARCKS actin filament crosslinking\n 41 243018_at 0.0407 EST/BE568408 ---\n 42 221755_at 0.0396 EHBP1L1 ---\n 43 208667_s_at -0.0390 ST13 protein folding\n 44 212055_at 0.0384 C18orf10 cytoskeleton\n 45 201292_at -0.0372 TOP2A DNA ligation\n 46 201102_s_at 0.0349 PFKL fructose 6-phosphate metabolic process\n 47 214150_x_at -0.0349 ATP6V0E1 proton transport\n 48 226742_at -0.0345 SAR1B transport\n 49 215181_at -0.0342 CDH22 cell adhesion\n 50 208904_s_at -0.0334 RPS28 rRNA processing\n\n 10\n"
> #去掉分隔符"\n"
> b[10] %>% str_split("\n")
[[1]][1] " SUPPLEMENTAL TABLES" [2] "" [3] "Table S1. EMC-92 gene signature. Probe sets are ordered by decreasing magnitude of weighting coefficient" [4] "(beta)" [5] " Weighting Symbol" [6] "Rank Probes GO-term/description1" [7] " coefficient (beta)" [8] " 1 202728_s_at -0.1105 LTBP1 negative regulation of TGFbeta receptor signaling" [9] " 2 239054_at -0.1088 SFMBT1 regulation of transcription"
[10] " 3 208942_s_at -0.0997 SEC62 cotranslational protein targeting to membrane"
[11] " 4 208747_s_at -0.0874 C1S proteolysis"
[12] " 5 202542_s_at 0.0870 AIMP1 negative regulation of endothelial cell proliferation"
[13] " 6 214482_at 0.0861 ZBTB25 transcription"
[14] " 7 228416_at -0.0778 ACVR2A transmembrane receptor protein serine/threonine kinase signaling"
[15] " 8 217728_at 0.0773 S100A6 signal transduction"
[16] " 9 215177_s_at -0.0768 ITGA6 cell-substrate junction assembly"
[17] " 10 225601_at 0.0750 HMGB3 multicellular organismal development"
[18] " 11 207618_s_at 0.0746 BCS1L mitochondrion organization"
[19] " 12 231989_s_at 0.0730 LOC100271836 ---"
[20] " 13 202884_s_at 0.0714 PPP2R1B control of cell growth and division"
[21] " 14 231738_at 0.0686 PCDHB7 calcium-dependent cell-cell adhesion"
[22] " 15 238116_at 0.0661 DYNLRB2 microtubule-based movement"
[23] " 16 226218_at -0.0644 IL7R regulation of DNA recombination"
[24] " 17 202842_s_at -0.0626 DNAJB9 protein folding"
[25] " 18 208732_at -0.0618 RAB2A ER to Golgi vesicle-mediated transport"
[26] " 19 204379_s_at 0.0594 FGFR3 MAPKKK cascade"
[27] " 20 242180_at -0.0585 TSPAN16 cellular activation and adhesion"
[28] " 21 216473_x_at -0.0576 DUX4 regulation of transcription, DNA-dependent"
[29] " 22 209683_at -0.0561 FAM49A ---"
[30] " 23 219550_at 0.0559 ROBO3 axon guidance"
[31] " 24 223811_s_at 0.0556 SUN1 / GET4 cytoskeletal anchoring at nuclear membrane"
[32] " 25 202813_at 0.0548 TARBP1 regulation of transcription from RNA polymerase II promoter"
[33] " 26 212282_at 0.0530 TMEM97 cholesterol homeostasis"
[34] " 27 238780_s_at -0.0529 EST/ BX647543 ---"
[35] " 28 M97935_MA_at2 0.0525 STAT1 transcription from RNA polymerase II promoter"
[36] " 29 221041_s_at -0.0520 SLC17A5 anion transport"
[37] " 30 224009_x_at -0.0520 DHRS9 androgen metabolic process"
[38] " 31 214612_x_at 0.0496 MAGEA6 ---"
[39] " 32 208232_x_at -0.0493 --- ---"
[40] " 33 238662_at 0.0490 ATPBD4 ---"
[41] " 34 206204_at 0.0477 GRB14 signal transduction"
[42] " 35 233437_at 0.0446 GABRA4 transport"
[43] " 36 200875_s_at 0.0437 NOP56 rRNA processing"
[44] " 37 38158_at 0.0423 ESPL1 apoptosis"
[45] " 38 217548_at -0.0423 C15orf38 ---"
[46] " 39 220351_at 0.0420 CCRL1 chemotaxis"
[47] " 40 213002_at -0.0418 MARCKS actin filament crosslinking"
[48] " 41 243018_at 0.0407 EST/BE568408 ---"
[49] " 42 221755_at 0.0396 EHBP1L1 ---"
[50] " 43 208667_s_at -0.0390 ST13 protein folding"
[51] " 44 212055_at 0.0384 C18orf10 cytoskeleton"
[52] " 45 201292_at -0.0372 TOP2A DNA ligation"
[53] " 46 201102_s_at 0.0349 PFKL fructose 6-phosphate metabolic process"
[54] " 47 214150_x_at -0.0349 ATP6V0E1 proton transport"
[55] " 48 226742_at -0.0345 SAR1B transport"
[56] " 49 215181_at -0.0342 CDH22 cell adhesion"
[57] " 50 208904_s_at -0.0334 RPS28 rRNA processing"
[58] ""
[59] " 10"
[60] "" > #变整齐了,再去掉空行
> b[10] %>% str_split("\n") %>% .[[1]] %>% .[-c(1:7)] %>% .[-c(51:53)][1] " 1 202728_s_at -0.1105 LTBP1 negative regulation of TGFbeta receptor signaling" [2] " 2 239054_at -0.1088 SFMBT1 regulation of transcription" [3] " 3 208942_s_at -0.0997 SEC62 cotranslational protein targeting to membrane" [4] " 4 208747_s_at -0.0874 C1S proteolysis" [5] " 5 202542_s_at 0.0870 AIMP1 negative regulation of endothelial cell proliferation" [6] " 6 214482_at 0.0861 ZBTB25 transcription" [7] " 7 228416_at -0.0778 ACVR2A transmembrane receptor protein serine/threonine kinase signaling"[8] " 8 217728_at 0.0773 S100A6 signal transduction" [9] " 9 215177_s_at -0.0768 ITGA6 cell-substrate junction assembly"
[10] " 10 225601_at 0.0750 HMGB3 multicellular organismal development"
[11] " 11 207618_s_at 0.0746 BCS1L mitochondrion organization"
[12] " 12 231989_s_at 0.0730 LOC100271836 ---"
[13] " 13 202884_s_at 0.0714 PPP2R1B control of cell growth and division"
[14] " 14 231738_at 0.0686 PCDHB7 calcium-dependent cell-cell adhesion"
[15] " 15 238116_at 0.0661 DYNLRB2 microtubule-based movement"
[16] " 16 226218_at -0.0644 IL7R regulation of DNA recombination"
[17] " 17 202842_s_at -0.0626 DNAJB9 protein folding"
[18] " 18 208732_at -0.0618 RAB2A ER to Golgi vesicle-mediated transport"
[19] " 19 204379_s_at 0.0594 FGFR3 MAPKKK cascade"
[20] " 20 242180_at -0.0585 TSPAN16 cellular activation and adhesion"
[21] " 21 216473_x_at -0.0576 DUX4 regulation of transcription, DNA-dependent"
[22] " 22 209683_at -0.0561 FAM49A ---"
[23] " 23 219550_at 0.0559 ROBO3 axon guidance"
[24] " 24 223811_s_at 0.0556 SUN1 / GET4 cytoskeletal anchoring at nuclear membrane"
[25] " 25 202813_at 0.0548 TARBP1 regulation of transcription from RNA polymerase II promoter"
[26] " 26 212282_at 0.0530 TMEM97 cholesterol homeostasis"
[27] " 27 238780_s_at -0.0529 EST/ BX647543 ---"
[28] " 28 M97935_MA_at2 0.0525 STAT1 transcription from RNA polymerase II promoter"
[29] " 29 221041_s_at -0.0520 SLC17A5 anion transport"
[30] " 30 224009_x_at -0.0520 DHRS9 androgen metabolic process"
[31] " 31 214612_x_at 0.0496 MAGEA6 ---"
[32] " 32 208232_x_at -0.0493 --- ---"
[33] " 33 238662_at 0.0490 ATPBD4 ---"
[34] " 34 206204_at 0.0477 GRB14 signal transduction"
[35] " 35 233437_at 0.0446 GABRA4 transport"
[36] " 36 200875_s_at 0.0437 NOP56 rRNA processing"
[37] " 37 38158_at 0.0423 ESPL1 apoptosis"
[38] " 38 217548_at -0.0423 C15orf38 ---"
[39] " 39 220351_at 0.0420 CCRL1 chemotaxis"
[40] " 40 213002_at -0.0418 MARCKS actin filament crosslinking"
[41] " 41 243018_at 0.0407 EST/BE568408 ---"
[42] " 42 221755_at 0.0396 EHBP1L1 ---"
[43] " 43 208667_s_at -0.0390 ST13 protein folding"
[44] " 44 212055_at 0.0384 C18orf10 cytoskeleton"
[45] " 45 201292_at -0.0372 TOP2A DNA ligation"
[46] " 46 201102_s_at 0.0349 PFKL fructose 6-phosphate metabolic process"
[47] " 47 214150_x_at -0.0349 ATP6V0E1 proton transport"
[48] " 48 226742_at -0.0345 SAR1B transport"
[49] " 49 215181_at -0.0342 CDH22 cell adhesion"
[50] " 50 208904_s_at -0.0334 RPS28 rRNA processing"
> #看着更规整了,再去掉空格
> b[10] %>% str_split("\n") %>% .[[1]] %>% .[-c(1:7)] %>% .[-c(51:53)] %>% str_split(" ")
[[1]][1] "" "" "1" "" "" "" [7] "202728_s_at" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "-0.1105" "" "" "" ""
[25] "" "" "" "" "" "LTBP1"
[31] "" "" "" "" "" ""
[37] "" "" "" "" "negative" "regulation"
[43] "of" "TGFbeta" "receptor" "signaling" [[2]][1] "" "" "2" "" "" [6] "" "239054_at" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "-0.1088" "" "" ""
[26] "" "" "" "" ""
[31] "" "SFMBT1" "" "" ""
[36] "" "" "" "" ""
[41] "" "regulation" "of" "transcription"[[3]][1] "" "" "3" "" [5] "" "" "208942_s_at" "" [9] "" "" "" ""
[13] "" "" "" ""
[17] "" "" "" "-0.0997"
[21] "" "" "" ""
[25] "" "" "" ""
[29] "" "SEC62" "" ""
[33] "" "" "" ""
[37] "" "" "" ""
[41] "cotranslational" "protein" "targeting" "to"
[45] "membrane" [[4]][1] "" "" "4" "" "" "" [7] "208747_s_at" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "-0.0874" "" "" "" ""
[25] "" "" "" "" "" "C1S"
[31] "" "" "" "" "" ""
[37] "" "" "" "" "" ""
[43] "proteolysis"[[5]][1] "" "" "5" "" "" [6] "" "202542_s_at" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "0.0870" "" "" "" ""
[26] "" "" "" "" ""
[31] "AIMP1" "" "" "" ""
[36] "" "" "" "" ""
[41] "" "negative" "regulation" "of" "endothelial"
[46] "cell" "proliferation"[[6]][1] "" "" "6" "" "" [6] "" "214482_at" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "" "0.0861" "" ""
[26] "" "" "" "" ""
[31] "" "" "ZBTB25" "" ""
[36] "" "" "" "" ""
[41] "" "" "transcription"[[7]][1] "" "" "7" "" [5] "" "" "228416_at" "" [9] "" "" "" ""
[13] "" "" "" ""
[17] "" "" "" ""
[21] "" "-0.0778" "" ""
[25] "" "" "" ""
[29] "" "" "" "ACVR2A"
[33] "" "" "" ""
[37] "" "" "" ""
[41] "" "transmembrane" "receptor" "protein"
[45] "serine/threonine" "kinase" "signaling" [[8]][1] "" "" "8" "" "" [6] "" "217728_at" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "" "0.0773" "" ""
[26] "" "" "" "" ""
[31] "" "" "S100A6" "" ""
[36] "" "" "" "" ""
[41] "" "" "signal" "transduction"[[9]][1] "" "" "9" "" "" [6] "" "215177_s_at" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" "-0.0768"
[21] "" "" "" "" ""
[26] "" "" "" "" "ITGA6"
[31] "" "" "" "" ""
[36] "" "" "" "" ""
[41] "cell-substrate" "junction" "assembly" [[10]][1] "" "" "10" "" "" [6] "225601_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "0.0750" "" "" ""
[26] "" "" "" "" ""
[31] "" "HMGB3" "" "" ""
[36] "" "" "" "" ""
[41] "" "" "multicellular" "organismal" "development" [[11]][1] "" "" "11" "" "" [6] "207618_s_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" "0.0746"
[21] "" "" "" "" ""
[26] "" "" "" "" "BCS1L"
[31] "" "" "" "" ""
[36] "" "" "" "" ""
[41] "mitochondrion" "organization" [[12]][1] "" "" "12" "" "" [6] "231989_s_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" "0.0730"
[21] "" "" "" "" ""
[26] "" "" "" "" "LOC100271836"
[31] "" "" "" "---" [[13]][1] "" "" "13" "" "" "202884_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "0.0714" "" "" "" ""
[25] "" "" "" "" "" "PPP2R1B"
[31] "" "" "" "" "" ""
[37] "" "" "control" "of" "cell" "growth"
[43] "and" "division" [[14]][1] "" "" "14" "" [5] "" "231738_at" "" "" [9] "" "" "" ""
[13] "" "" "" ""
[17] "" "" "" ""
[21] "" "0.0686" "" ""
[25] "" "" "" ""
[29] "" "" "" "PCDHB7"
[33] "" "" "" ""
[37] "" "" "" ""
[41] "" "calcium-dependent" "cell-cell" "adhesion" [[15]][1] "" "" "15" "" [5] "" "238116_at" "" "" [9] "" "" "" ""
[13] "" "" "" ""
[17] "" "" "" ""
[21] "" "0.0661" "" ""
[25] "" "" "" ""
[29] "" "" "" "DYNLRB2"
[33] "" "" "" ""
[37] "" "" "" ""
[41] "microtubule-based" "movement" [[16]][1] "" "" "16" "" "" [6] "226218_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "-0.0644" "" "" "" ""
[26] "" "" "" "" ""
[31] "IL7R" "" "" "" ""
[36] "" "" "" "" ""
[41] "" "" "regulation" "of" "DNA"
[46] "recombination"[[17]][1] "" "" "17" "" "" "202842_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0626" "" "" "" "" ""
[25] "" "" "" "" "DNAJB9" ""
[31] "" "" "" "" "" ""
[37] "" "" "protein" "folding" [[18]][1] "" "" "18" "" [5] "" "208732_at" "" "" [9] "" "" "" ""
[13] "" "" "" ""
[17] "" "" "" ""
[21] "-0.0618" "" "" ""
[25] "" "" "" ""
[29] "" "" "RAB2A" ""
[33] "" "" "" ""
[37] "" "" "" ""
[41] "" "ER" "to" "Golgi"
[45] "vesicle-mediated" "transport" [[19]][1] "" "" "19" "" "" "204379_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "0.0594" "" "" "" ""
[25] "" "" "" "" "" "FGFR3"
[31] "" "" "" "" "" ""
[37] "" "" "" "" "MAPKKK" "cascade" [[20]][1] "" "" "20" "" "" "242180_at" [7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "" "-0.0585" "" "" ""
[25] "" "" "" "" "" ""
[31] "TSPAN16" "" "" "" "" ""
[37] "" "" "" "cellular" "activation" "and"
[43] "adhesion" [[21]][1] "" "" "21" "" "" [6] "216473_x_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "-0.0576" ""
[21] "" "" "" "" ""
[26] "" "" "" "DUX4" ""
[31] "" "" "" "" ""
[36] "" "" "" "" ""
[41] "regulation" "of" "transcription," "DNA-dependent" [[22]][1] "" "" "22" "" "" "209683_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" "-0.0561"
[22] "" "" "" "" "" "" ""
[29] "" "" "FAM49A" "" "" "" ""
[36] "" "" "" "" "" "---" [[23]][1] "" "" "23" "" "" "219550_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" ""
[22] "0.0559" "" "" "" "" "" ""
[29] "" "" "" "ROBO3" "" "" ""
[36] "" "" "" "" "" "" ""
[43] "axon" "guidance" [[24]][1] "" "" "24" "" "" [6] "223811_s_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" "0.0556"
[21] "" "" "" "" ""
[26] "" "" "" "" "SUN1"
[31] "/" "GET4" "" "" ""
[36] "" "cytoskeletal" "anchoring" "at" "nuclear"
[41] "membrane" [[25]][1] "" "" "25" "" "" [6] "202813_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "0.0548" "" "" ""
[26] "" "" "" "" ""
[31] "" "TARBP1" "" "" ""
[36] "" "" "" "" ""
[41] "" "regulation" "of" "transcription" "from"
[46] "RNA" "polymerase" "II" "promoter" [[26]][1] "" "" "26" "" "" "212282_at" [7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "" "" "0.0530" "" ""
[25] "" "" "" "" "" ""
[31] "" "TMEM97" "" "" "" ""
[37] "" "" "" "" "" "cholesterol"
[43] "homeostasis"[[27]][1] "" "" "27" "" "" "238780_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0529" "" "" "" "" ""
[25] "" "" "" "" "EST/" "BX647543"
[31] "" "" "---" [[28]][1] "" "" "28" "" "" [6] "M97935_MA_at2" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "0.0525" "" ""
[21] "" "" "" "" ""
[26] "" "" "STAT1" "" ""
[31] "" "" "" "" ""
[36] "" "" "" "transcription" "from"
[41] "RNA" "polymerase" "II" "promoter" [[29]][1] "" "" "29" "" "" "221041_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0520" "" "" "" "" ""
[25] "" "" "" "" "SLC17A5" ""
[31] "" "" "" "" "" ""
[37] "" "anion" "transport" [[30]][1] "" "" "30" "" "" "224009_x_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0520" "" "" "" "" ""
[25] "" "" "" "" "DHRS9" ""
[31] "" "" "" "" "" ""
[37] "" "" "" "androgen" "metabolic" "process" [[31]][1] "" "" "31" "" "" "214612_x_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "0.0496" "" "" "" ""
[25] "" "" "" "" "" "MAGEA6"
[31] "" "" "" "" "" ""
[37] "" "" "" "---" [[32]][1] "" "" "32" "" "" "208232_x_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0493" "" "" "" "" ""
[25] "" "" "" "" "---" ""
[31] "" "" "" "" "" ""
[37] "" "" "" "" "" "---" [[33]][1] "" "" "33" "" "" "238662_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" ""
[22] "0.0490" "" "" "" "" "" ""
[29] "" "" "" "ATPBD4" "" "" ""
[36] "" "" "" "" "" "" "---" [[34]][1] "" "" "34" "" "" [6] "206204_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "0.0477" "" "" ""
[26] "" "" "" "" ""
[31] "" "GRB14" "" "" ""
[36] "" "" "" "" ""
[41] "" "" "signal" "transduction"[[35]][1] "" "" "35" "" "" "233437_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" ""
[22] "0.0446" "" "" "" "" "" ""
[29] "" "" "" "GABRA4" "" "" ""
[36] "" "" "" "" "" "" "transport"[[36]][1] "" "" "36" "" "" "200875_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "0.0437" "" "" "" ""
[25] "" "" "" "" "" "NOP56"
[31] "" "" "" "" "" ""
[37] "" "" "" "" "rRNA" "processing" [[37]][1] "" "" "37" "" "" "38158_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" ""
[22] "" "0.0423" "" "" "" "" ""
[29] "" "" "" "" "ESPL1" "" ""
[36] "" "" "" "" "" "" ""
[43] "" "apoptosis"[[38]][1] "" "" "38" "" "" "217548_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" "-0.0423"
[22] "" "" "" "" "" "" ""
[29] "" "" "C15orf38" "" "" "" ""
[36] "" "" "" "---" [[39]][1] "" "" "39" "" "" "220351_at" [7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "" "" "0.0420" "" ""
[25] "" "" "" "" "" ""
[31] "" "CCRL1" "" "" "" ""
[37] "" "" "" "" "" ""
[43] "chemotaxis"[[40]][1] "" "" "40" "" "" [6] "213002_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "-0.0418" "" "" "" ""
[26] "" "" "" "" ""
[31] "MARCKS" "" "" "" ""
[36] "" "" "" "" ""
[41] "actin" "filament" "crosslinking"[[41]][1] "" "" "41" "" "" [6] "243018_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "0.0407" "" "" ""
[26] "" "" "" "" ""
[31] "" "EST/BE568408" "" "" ""
[36] "---" [[42]][1] "" "" "42" "" "" "221755_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" ""
[22] "0.0396" "" "" "" "" "" ""
[29] "" "" "" "EHBP1L1" "" "" ""
[36] "" "" "" "" "" "---" [[43]][1] "" "" "43" "" "" "208667_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0390" "" "" "" "" ""
[25] "" "" "" "" "ST13" ""
[31] "" "" "" "" "" ""
[37] "" "" "" "" "protein" "folding" [[44]][1] "" "" "44" "" "" [6] "212055_at" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "0.0384" "" "" ""
[26] "" "" "" "" ""
[31] "" "C18orf10" "" "" ""
[36] "" "" "" "" "cytoskeleton"[[45]][1] "" "" "45" "" "" "201292_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" "-0.0372"
[22] "" "" "" "" "" "" ""
[29] "" "" "TOP2A" "" "" "" ""
[36] "" "" "" "" "" "" "DNA"
[43] "ligation" [[46]][1] "" "" "46" "" "" "201102_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "" "0.0349" "" "" "" ""
[25] "" "" "" "" "" "PFKL"
[31] "" "" "" "" "" ""
[37] "" "" "" "" "" "fructose"
[43] "6-phosphate" "metabolic" "process" [[47]][1] "" "" "47" "" "" "214150_x_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0349" "" "" "" "" ""
[25] "" "" "" "" "ATP6V0E1" ""
[31] "" "" "" "" "" ""
[37] "proton" "transport" [[48]][1] "" "" "48" "" "" "226742_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" "-0.0345"
[22] "" "" "" "" "" "" ""
[29] "" "" "SAR1B" "" "" "" ""
[36] "" "" "" "" "" "" "transport"[[49]][1] "" "" "49" "" "" "215181_at" "" [8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" "-0.0342"
[22] "" "" "" "" "" "" ""
[29] "" "" "CDH22" "" "" "" ""
[36] "" "" "" "" "" "" "cell"
[43] "adhesion" [[50]][1] "" "" "50" "" "" "208904_s_at"[7] "" "" "" "" "" ""
[13] "" "" "" "" "" ""
[19] "-0.0334" "" "" "" "" ""
[25] "" "" "" "" "RPS28" ""
[31] "" "" "" "" "" ""
[37] "" "" "" "rRNA" "processing" > #只用把""去掉然后提取gene就好啦,可用for循环,也可以用lapply函数,道理都是相同的
> # gene <- list()
> # for (i in 1:50){> # gene_name <- b2 %>% .[[i]] %>% .[.!=""] %>% .[4] # gene名排在第四个,根据不同的数据做不同的处理
> # gene <- rbind(gene,gene_name)
> # }
> b[10] %>% str_split("\n") %>% .[[1]] %>% .[-c(1:7)] %>% .[-c(51:53)] %>% str_split(" ") %>% lapply(\(x){x[x!=""]%>%.[4]})
[[1]]
[1] "LTBP1"[[2]]
[1] "SFMBT1"[[3]]
[1] "SEC62"[[4]]
[1] "C1S"[[5]]
[1] "AIMP1"[[6]]
[1] "ZBTB25"[[7]]
[1] "ACVR2A"[[8]]
[1] "S100A6"[[9]]
[1] "ITGA6"[[10]]
[1] "HMGB3"[[11]]
[1] "BCS1L"[[12]]
[1] "LOC100271836"[[13]]
[1] "PPP2R1B"[[14]]
[1] "PCDHB7"[[15]]
[1] "DYNLRB2"[[16]]
[1] "IL7R"[[17]]
[1] "DNAJB9"[[18]]
[1] "RAB2A"[[19]]
[1] "FGFR3"[[20]]
[1] "TSPAN16"[[21]]
[1] "DUX4"[[22]]
[1] "FAM49A"[[23]]
[1] "ROBO3"[[24]]
[1] "SUN1"[[25]]
[1] "TARBP1"[[26]]
[1] "TMEM97"[[27]]
[1] "EST/"[[28]]
[1] "STAT1"[[29]]
[1] "SLC17A5"[[30]]
[1] "DHRS9"[[31]]
[1] "MAGEA6"[[32]]
[1] "---"[[33]]
[1] "ATPBD4"[[34]]
[1] "GRB14"[[35]]
[1] "GABRA4"[[36]]
[1] "NOP56"[[37]]
[1] "ESPL1"[[38]]
[1] "C15orf38"[[39]]
[1] "CCRL1"[[40]]
[1] "MARCKS"[[41]]
[1] "EST/BE568408"[[42]]
[1] "EHBP1L1"[[43]]
[1] "ST13"[[44]]
[1] "C18orf10"[[45]]
[1] "TOP2A"[[46]]
[1] "PFKL"[[47]]
[1] "ATP6V0E1"[[48]]
[1] "SAR1B"[[49]]
[1] "CDH22"[[50]]
[1] "RPS28"
pdf_data()
pdf_data() 可将pdf每页返回为数据帧
pdf_render_page()
render into a raw bitmap array for further processing in R
pdf_convert()
High quality conversion of pdf page(s) to png, jpeg or tiff format
这几个功能还没有用上,等用过了在回来写
利用R处理PDF文件相关推荐
- 利用Python提取PDF文件中的文本信息
如何利用Python提取PDF文件中的文本信息 日常工作中我们经常会用到pdf格式的文件,大多数情况下是浏览或者编辑pdf信息,但有时候需要提取pdf中的文本,如果是单个文件的话还可以通过复制粘贴来直 ...
- 如何用python修改pdf内容_如何利用python将pdf文件转化为txt文件?
https://www.wukong.com/answer/6579491774144708872/?iid=15906422033&app=news_article&share_an ...
- 利用Word制作pdf文件的方法
利用Word制作pdf文件的方法 一.先用手机照成图片 二.把图片拖到word中 三.生成pdf文件 一.先用手机照成图片 二.把图片拖到word中 三.生成pdf文件 点文件 点导出 点创建PDF ...
- itextsharp 获取文本_利用iTextSharp提取PDF文件中的文本内容
最近测试中需要对比两个PDF文件的内容,当然只是文字没有图表的,但是没有现成的工具可用.于是我的想法是先把PDF转换为Text,然后再对比Text的内容.现在问题的关键变成了如何提取PDF中的文本,在 ...
- R语言 PDF文件损坏或打不开
最近在做ROC曲线,发现有的PDF文件打不开,提示''已损坏或者打开格式不对",第一段代码画的PDF可打开,第二段代码则的PDF文件一直打不开,捯饬了大半天,终于找到解决办法:1. 把原来的 ...
- 利用pdfbox读取pdf文件内容和图片
最近用pdfbox读取pdf文件中的内容和图片,可以获取每一页的内容和图片,但有个问题是没法获取图片在页面的位置.源码如下: package com.util; import java.awt.ima ...
- vue 中利用canvas 给pdf文件加水印---详细教程(附上完整代码)
需求:在h5网页中打开pdf文件,要求给文件添加水印 实现技术及插件:vue,vue-pdf,canvas 插件安装: npm i vue-pdf --save npm i pdf-lib --sav ...
- bfo java_Java 利用BFO操作PDF文件
[java]代码库import org.faceless.pdf2.*; import java.util.Locale; import java.awt.Color; import java.uti ...
- html与css入门经典doc,HTML+CSS入门 flying-saucer如何利用HTML来生成PDF文件
本篇教程介绍了HTML+CSS入门 flying-saucer如何利用HTML来生成PDF文件,希望阅读本篇文章以后大家有所收获,帮助大家HTML+CSS入门. < 1.导入maven依赖 9. ...
最新文章
- 博士申请 | 澳门大学汪澎洋助理教授招收机器学习方向全奖博士生
- 全排列(我开始怀疑自己的智商了....)
- Snapchat, 给年轻人要的安全感
- 更改数据库管理员sa账户密码
- Visual Studio调试技巧
- 【转】linux /centos 中OpenSSL升级方法详解
- 网管必杀技之VLAN的网络管理
- int 转string
- 再议Python协程——从yield到asyncio
- c#用友U8API开发之环境搭建(1)
- cocos creator-js-虚拟摇杆
- wps复选框怎么设置_wps中excel复选框怎么设置
- 0.96OLED显示原理及FPGA驱动程序
- 南开计算机等级,南开100题分类-全国计算机等级考试上机考试习题集(二级C)(南开大学出版社)...
- ExpandableListView使用方法详解
- 趣味项目—MyQQ机器人(二)关于python的pandas根据索引读写指定数据的方法实现签到功能
- java漫画pdf_Java并发编程学习宝典(漫画版)(PDF+HTML完结)
- Linux系统的上行和下行带宽的检测
- Excel日期加斜杠,日期时间戳互转
- ISCC-2019部分wp
热门文章
如何利用Python提取PDF文件中的文本信息 日常工作中我们经常会用到pdf格式的文件,大多数情况下是浏览或者编辑pdf信息,但有时候需要提取pdf中的文本,如果是单个文件的话还可以通过复制粘贴来直 ...
https://www.wukong.com/answer/6579491774144708872/?iid=15906422033&app=news_article&share_an ...
利用Word制作pdf文件的方法 一.先用手机照成图片 二.把图片拖到word中 三.生成pdf文件 一.先用手机照成图片 二.把图片拖到word中 三.生成pdf文件 点文件 点导出 点创建PDF ...
最近测试中需要对比两个PDF文件的内容,当然只是文字没有图表的,但是没有现成的工具可用.于是我的想法是先把PDF转换为Text,然后再对比Text的内容.现在问题的关键变成了如何提取PDF中的文本,在 ...
最近在做ROC曲线,发现有的PDF文件打不开,提示''已损坏或者打开格式不对",第一段代码画的PDF可打开,第二段代码则的PDF文件一直打不开,捯饬了大半天,终于找到解决办法:1. 把原来的 ...
最近用pdfbox读取pdf文件中的内容和图片,可以获取每一页的内容和图片,但有个问题是没法获取图片在页面的位置.源码如下: package com.util; import java.awt.ima ...
需求:在h5网页中打开pdf文件,要求给文件添加水印 实现技术及插件:vue,vue-pdf,canvas 插件安装: npm i vue-pdf --save npm i pdf-lib --sav ...
[java]代码库import org.faceless.pdf2.*; import java.util.Locale; import java.awt.Color; import java.uti ...
本篇教程介绍了HTML+CSS入门 flying-saucer如何利用HTML来生成PDF文件,希望阅读本篇文章以后大家有所收获,帮助大家HTML+CSS入门. < 1.导入maven依赖 9. ...