How to print row(s) if they meet a certain range
I have two mega files that look like below:
f1:
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
f2:
chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025
What I want to do is to print the rows of file 2 in a separate column in file 1 IF the range of start
and end
column of file2 is inside the ranges in file1 (columns 2 and 3) and chr
is the same. So based on the dummy example files I provided - the desired output should be (only the range of linc1320
is in the first row of the file1):
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
I am not a professional coder but I have been using this code to manually change the ranges based on the file2:
awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv
I do not have a particular preference for using a specific programming language - both Python
and awk
would be very helpful. Any help is appreciated thank you.
-
2Are you okay with using the pandas library in python or you want the script in core python itself? – Shubham Sharma 2 days ago
-
1I would like to also have the solution with pandas as well - thank you. – Apex 2 days ago